5 min read
Learn how to eliminate the extra HTML tags generated by Microsoft Word 2000 and later versions
Eliminate Clutter and Clean  HTML Fluff in Microsoft Word Generated HTML Documents, specifically for Emails

This post is a distillation of lessons learned during reverse engineering an email template and making it somewhat editable. By the end of this post, you will learn how to tame a Microsoft Word-generated HTML template so that you can use it for your purposes.

Perhaps you had created a Microsoft Word document and used the HTML code generated by it and lost the original Word document and want to make changes in pure HTML code. Or you are in an unfortunate position where you don't have access to the original Word document but need a cleaner version of HTML code. Or you just want pure clean HTML code because it is a matter of principle. Or it's just a fascinating intellectual exercise. You could also just be using Word to create simple web pages or landing pages but know that having a cleaner and more semantic version of your HTML code will help you rank better on search engines as well as reduce page loading times for your users. Whatever the reason, you want clean HTML.

People at work asked me to automate an email that we need to send periodically. Of course, I said this won't be a problem, and it wasn't supposed to be. I asked them to provide an HTML of the email they'd like to send and I'll add some dynamic text and send it off.

Turns out the task I thought would be the simplest wasn't that simple.

The template I received was sent by email on Microsoft Outlook. There were some simple styling issues such as inconsistent fonts for some of the text which I thought would be a simple fix. When I viewed the source of the email, I was horrified to see it filled with junk I'm not usually used to. I attempted to employ some of my basic CSS skills to resolve these and thought I don't need to optimize the template as it's doing its job and no one is going to care about viewing the source as long as it renders beautifully. I didn't want spend time to manually handle CSS quirks since I'm not that great at CSS and I know I will get stuck in a back-and-forth loop of fixing minor quirks here and there which weren't that important.

In the first iteration, I made some minor changes as used online tools like https://html-cleaner.com/ to clean up some of the mess, as long as the formatting still staying intact.

In the second iteration, I attempted to redo the entire template by creating a CSS grid to ensure the right layout using https://grid.layoutit.com/. This wasn't too bad. I spent perhaps 20 minutes defining the grid. Considering I also learned this grid tool on the run, it was fairly quick.

This is the HTML it created:

CSS Grid Layout HTML

And this is the CSS that I was blessed with, good thing I don't have to spend too much brains on what's going on in here as long as it works.

CSS Grid Styles

I then manually populated the template with my desired content and basic CSS styling and was able to create a modern well-formatted HTML email template from scratch that was I proud of.

Until I tested sending an email with the template in Outlook. That's when I discovered that many email client software like Microsoft Outlook can't render modern CSS well. I spent a good few hours on this, but with no meaningful result.

Back to the drawing board.

Finally, I gave it another crack with HTML tidy. This is the tool the cleaned up the HTML well enough to maintain the semantic elements enough that a human can easily interpret them and make changes to the CSS. However, there is a trick to this that I will reveal shortly.

If you're tech-savvy, you can download and install the tidy utility on Windows, Linux, or Mac. Or you can use an online version of the tool at https://infohound.net/tidy/ (I've done both).

infohound.net/tidy

Select the 'Word 2000' option and click Tidy!

You'll get this page that shows the warning as well as cleaned HTML. Copy and paste that into JSBin.com. jsbin.com is a free web-based editor to quickly make and preview changes you make to an HTML document.

Just look at how clean that HTML is on the left and a clean version of it in the preview on the right.

And this the clean HTML that was generated. How clean!

But there is one problem that you may have noted. All the formatting is gone. Do you want to do all the formatting by hand? Well yes and no. There's no 100% automated way to do this that I know of, but you can get some of the styles back and then tweak them by hand.

But this requires dealing with an issue with tidy. See when you run tidy with the --clean y and --word-2000 y options selected, there are CSS style rules generated that preserve some of the original formattings, but are not embedded in the document. I don't know why, I assume it's a bug.

But if you first run the document with only the --clean y option, you do get a CSS style tag as below:

Now you copy the CSS styles and then run it with the --word-2000 option. Then copy the styles generated the first time manually to your document, you'll get something better.

You may not see it, but the footer text is bold. As I said it's not ideal, but with the additional CSS rules available you have a little more to start with. It didn't work entirely great in this fabricated example, but my original email worked fairly well with the background colors coming through. It's a hit and miss.

If you like this sort of reverse engineering and would like to get updates on random interesting stuff I'm working on and what I'm learning along the way, give me a follow on Twitter @shuuabe.🐤