How to Clean MS Office Specific Tags from HTML

MS Word is more than just a word processor for your offline documents. It does support editing and uploading content to various blog platforms such as Blogger, WordPress, SharePoint Blog etc. as well as emailing documents.

Also, though not ideal for this tasks, it does support editing web pages (*.html, *.htm, *.xml). As a matter of fact, sometimes after installing MS Office, some of this web files may be associated with Ms Word if there are no other programs already associated with them.

One problem however that arises with this web-editing support is the inclusion of MS Office specific tags that increase the size of web pages and emails in html format. This also occurs more often when you copy and paste from a word document to a post editor on a web page (e.g. in Blogger, WordPress etc.).

The typical pasting retains all the MS Office specific tags which, though not quite visible in the post editor (excluding Word specific formatting), they can be viewed by viewing the raw html code of the post.

Why Microsoft Includes the Office Specific Tags?

The MS Office specific tags are included to retain word specific features such as certain formatting and styles so that you can edit the document as it was originally should that need arise.

As displayed in the image below, it’s actually not just tags, but
comments are also included. The commented text usually makes the bulk
of the content.

tags
Tags and comments in a HTML file saved by MS Word

Another thing is that the tags/comments are also included when you copy and paste the content from word to your post editor; it doesn’t have to be saved as a webpage first (*.html, *.htm). This actually the most common way people end up unwittingly including this tags in their webpages.

Since we often need to optimize our content for fast loading, not to mention ensure compatibility, it’s recommended to remove these tags and even more the comments for they don’t serve any purpose.

Personally, I’ve seen this tags cause some strange errors on this blog, so I always make a point to not include them after using Word to proof read my content. To remove them you can use any of the following measures:

Option 1: When you don’t need to retain any Formatting

1. When copying and pasting to your online post editor use Ctrl+SHift+V instead of Ctrl+V. That should ideally paste the content in plain text only i.e. without any of the word formatting and tags.

2. The plain text pasting doesn’t always seem to work (don’t know if it’s a browser thing) in which case it’s safer to paste the content in a plain text editor (e.g. notepad, notepad++, sublime text etc.) then paste this in your post editor.

Option 2: When you need to retain (some) Formatting

I’m using the word “some” because not all Word “formattings” may be supported on your web page. Things like bullets, lists and tables however in most cases do work just quite fine, but you still need to remove those useless tags.

1. Save as a Filtered HTML

To do this save the document as a filtered HTML. Doing this will remove all the MS Office specific tags while still retaining most of the formatting and editing functionality.

To do this go to:

 Save As > Web Page, Filtered (*.htm, *.html) 

You can then open the saved web page with a plain text editor or a HTML editor and copy and paste the html code into your HTML post editor.

compare
Compare: Same file saved in Normal HTML vs Filtered HTML

If you only need a specific element from the web page (e.g. a table), you can just copy that part of the code then paste it in the post editor where you need it to appear.

2. Use an Online Clean up Tool

There are many online services that can clean HTML for you automatically, including removing those MS Office specific tags. For my uses I like using this one as it does a good job cleaning up word tables into HTML friendly tables.

As convenient as they are, these online “cleaners” may however change some of your formatting or may require you to play around with some settings to get your desired output.

There’s also a privacy risk if such a service is not clear on how it handles your content on their servers – for instance, do they store your content, and if so, for how long and do they analyse it in any way etc.

If you’re paranoid about all this, the first two options should suffice.

Leave a Reply

Feel free to share your comments or questions with me. I will NOT publish your email address nor do I mind if you use a fake one. This is only here to fight spammers and bots which you're likely neither if you're reading this.