Convert Broken HTML to XHTML

I recently made the decision to refactor a 600 page software manual. That’s a daunting task, so why did I do it? The old format was barely working, inflexible, required a truly awful propriety tool, and cost the company considerable time and money when changes (such as translations) were required.
The underlying pages were in HTML, or at least the closest thing to HTML that still actually worked. In reality, the code was awful; there were broken tags and redundant tags all over the place. The editor in question (developed by a small company in Hawaii) is nothing more than a wrapper around Microsoft’s free HTML Help Workshop tool. I decided to clean up the HTML (read: convert it to XHTML), dump the editor and dynamically build the manual the same way I’ve done at companies in the past. This is an ongoing project, but here’s how I handled the task of cleaning up ~600 HTML files, so they were in valid XHTML.
Resources:
HTML - Special Entity Codes
HTML Tidy
Online RegExr Test Tool and Interactive Tutorial
Sublime Text Editor
image
- running a Regular Expression replace in the Sublime Text editor
Step 1: Clean up the HTML with HTML Tidy
HTML Tidy is convenient way to repair poor HTML. It doesn’t fix everything, but it does help and it makes the code look a lot better since it will fix much of indentation. So the first thing I did was run this HTML Tidy command on all the files.
I ran this in Git Bash after turning off word wrap in tidy settings file. Even with the word wrap option, HTML Tidy inserted more newlines than you’d expect, so it isn’t perfect, but it made a big difference.
$ find /C/Manual -type f -name "*.htm" -exec tidy -f errors.txt -m -utf8 -i {} \;
Note that you can remove the HTML Tidy watermark pretty easily using find/replace in Sublime. And that is a nice segue to the next step.
Step 2: Simple Find/Replace in Sublime
Using the “Find in Files…” feature, it’s easy to make simple text substitutions in Sublime. For example, to be XHTML compliant, I need to convert   to  ,
to
,
and many other examples.
I also needed to simply remove some tags. For example, tags added when someone pasted text from Microsoft Word into the editor (e.g., and ).
Sublime will help you figure out the syntax to match just the current open file, all open files, or a whole directory structure. For example, in the “Where” box for Replace, you might enter c:\directory\test,*.htm to match all .htm files.
Step3: RegEx Find/Replace in Sublime
Simple find/replace actions got me part way there, but they wouldn’t solve all the issues I had to deal with in the broken HTML. The next step was to use Regular Expressions to enable some more sophisticated corrections.
One example, was attributes within HTML tags (such as size, height, etc.) that weren’t enclosed in quotation marks. Browsers will deal with that transgression, but it’s not valid XHTML. I had to find a quick way to add the quotes around these attributes in ~600 files. The answer was find/replace using regular expressions in Sublime.
Find: (size=)([0-9])
This creates two capturing groups with “size=” as the first and any number of 0-9 characters as the second.
Replace: $1"$2"
This replace command encloses the second capturing group in quotation marks. For example, size=100 becomes size=”100”.
Well that’s all for now. I hope you found this helpful. I encourage you to try the RegExr online tool; it’s helpful when refining regular expressions.