How to Avoid MS Office Specific Tags in Web Pages

MS Word is more than just a word processor for your offline documents. It does support editing and uploading content to various blog platforms such as Blogger, WordPress, SharePoint Blog etc. as well as emailing documents. Also, though not ideal for this tasks, it does support editing web pages (*.html, *.htm, *.xml). As a matter of fact, sometimes after installing MS Office, some of this web files may be associated with Ms Word if there are no other programs already associated with them.

One problem however that arises with this web-editing support is the inclusion of MS Office specific tags that increase the size of web pages and emails in HTML format. This also occurs more often when you copy and paste from a word document to a post editor on a web page (e.g. in Blogger, WordPress etc.). The typical pasting retains all the MS Office specific tags which, though not quite visible in the post editor (excluding Word specific formatting), they can be viewed by viewing the raw HTML code of the post. This MS Office specific tags are included to retain word specific features such as certain formatting and styles so that you can edit the document as it was originally should that need arise.
Tags and comments in a HTML file saved by MS Word

As displayed in the image above, it's actually not just tags, but comments are also included. The commented text usually makes the bulk of the content. Another thing is that the tags/comments are also included when you copy and paste the content from word to your post editor; it doesn't have to be saved as a webpage firts (*.html, *.htm). This actually the most common way people end up unwittingly including this tags in their websites.

Since we often need to optimize our content for fast loading, not to mention ensure compatibility, it's recommended to remove this tags and even more the comments for they don't serve any purpose. Personally, I've seen this tags cause some strange errors on this blog, so I always make a point to not include them after using Word to proof read my content. To remove them you can use any of the following measures:

A. When you don't need to retain any Formatting

1. When copying and pasting to your online post editor use Ctrl+SHift+V instead of Ctrl+V. That should ideally paste the content in plain text only i.e. without any of the word formatting and tags.

2. The plain text pasting doesn't always seem to work (don't know if it’s a browser thing) in which case it's safer to paste the content in a plain text editor (e.g. notepad, notepad++, sublime text etc.) then paste this in your post editor.

B. When you need to retain (some) Formatting

I'm using the word "some" because not all Word "formattings" may be supported on your web page. Things like bullets, lists and tables however in most cases do work just quite fine, but you still need to remove those useless tags. To do this:

1. Save the document as a filtered HTML - this will remove all the MS Office specific tags while still retaining most of the formatting and editing functionality. To do this go to:
Save As > Web Page, Filtered (*.htm, *.html)
save as
Save As... Filtered Web Page

You can then open the saved web page with a plain text editor/HTML editor and copy and paste the HTML code into your post editor while in HTML mode. If you only need a specific element from the web page (e.g. a table), you can just copy that part of the code then paste it in the post editor where you need it to appear.
Compare: Same file saved in Normal HTML and Filtered HTML

2. There are many online services that can clean your HTML for you automatically, including removing those MS Office specific tags. As convenient as they are, these online "cleaners" may however change some of your formatting or may require you to play around with some settings to get your desired output. There's also a privacy risk if such a service is not clear on how it handles your content on their servers - for instance, do they store your content, and if so, for how long and do they analyse it in any way etc. If you're paranoid about all this, the first two options should suffice.