How to OCR with MS Office

The other day I got some work from a high school teacher to type some past examinations that she was compiling into a book. The test papers were quite many, a file worth to be exact.

Knowing this would be a challenge in terms of the time and effort I’ll have to set aside, I decided to look for a much quicker solution. The first thing that came to my mind was OCR (Optical Character Recognition).

OCR is basically the identification of text from image files. In layman’s terms, think of it as converting images to text.

OCR can thus save you time and money that you’d otherwise spend typing or outsourcing to professionals. In my case, I was able to reduce the workload of this particular job by about 70-80%, and it would be higher were it not for the few wrongly identified characters and the touching up of some diagrams.

For my OCR needs, I went with MS Office. I had considered other options before settling for it, such as the feature rich PDF-XChange Editor that bundles OCR with its PDF viewer. Ultimately, however, they all proved to be less capable compared to the OCR engine in MS Office which was more accurate and faster.

I suppose that could be attributed to them using the free Tesseract OCR engine which, while powerful in its own right, tends to be outperformed by commercial alternatives.

Advertisements

Getting Started: Microsoft Office OCR Options

MS Office does OCR in two ways:

  • Using OneNote
  • Using Microsoft Office Document Imaging (MODI)

All MS Office versions of OneNote 2007 and later will do for this purpose. Note however that starting with Office 2019, the OneNote app for Windows 10 has superseded the past Office versions, but it doesn’t include the OCR option.

For MODI, things are a little different, as it was long discontinued. MS Office 2007 was the last version to feature it. However, you don’t necessarily need to have MS Office 2007 as it can be installed separately and be used with newer versions of MS Office.


What You’ll Need

  • OneNote (any version from Office 2007-2016)
  • For the standalone OCR function, you’ll need SharePoint Designer 2007 to install MODI. You can get it from two sources:
    • the standalone SharePoint Designer 2007 that is provided as a free download by Microsoft. MS pulled down the link from their download center; get it from Softpedia instead.
    • If you’ve a licensed copy of MS Office 2007 already, you can use it instead of having to download SharePoint Designer 2007.
  • Image to OCR
  • A scanner if you want to OCR during the scanning process.
Advertisements

1. How to OCR with OneNote

  1. Launch OneNote and start by creating a New Note.
  2. In the ribbon, go to the Insert tab and insert the image to OCR.
    A screenshot showing the insert picture button inside OneNote's ribbon
  3. Inside the note, right-click the inserted image and select Copy Text from Picture.
    A screenshot showing the copy text from picture item in OneNote's context menu
  4. Open MS Word or a text editor and paste the text that has been recognized.
  5. You can alternatively search the text within OneNote instead of copy-pasting it elsewhere. To do that, right-click the inserted image and select Make Text in Image Searchable, then select the language the text is in.
    A screenshot showing the Make Text in Image Searchable item in OneNote's context menu
  6. You can then use Ctrl+F to search for text inside the image. If it finds a match, it will be highlighted.

NOTE: If you need a different language, check the languages section on how to install additional language packs.

Advertisements

2. How to OCR with Microsoft Office Document Imaging (MODI)

Step 1. Installing MODI

  1. Run your SharePoint Designer 2007 or MS Office 2007 installer.
  2. Select the Customize installation option.
  3. Set all the available options to Not Available then expand Office Tools and set Microsoft Office Document Imaging to Run all from my Computer.
    A screenshot showing programs available for installation in Office 2007 installer
  4. Now leave it to install.

Step 2: OCR with MODI

MODI can run OCR in either of two ways:

  • By running the OCR on image files
  • By connecting to your scanner and automatically running the OCR after the document is completed scanning

i. OCR an Image

MODI only OCRs images that are in TIFF (*.tif, *.tiff) format. If you picture is in another format (e.g. JPEG, PNG, GIF) you can use an one of the many free image editors available online (XnView, IrfanView etc.) to convert them to TIFF.

You can even use Paint to do the conversion. Just open the image with Paint, choose to Save as then select Other Formats. In the save dialog, select the TIFF type and save the image.

Once you have your images in this format, do the follwoing:

  1. Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Imaging.
  2. Inside MODI, click the Open icon and select your TIFF image from the dialog.
    A screenshot showing the Open image icon in Microsoft Office Document Imaging
  3. Once the image is loaded inside MODI, click the Recognize Text Using OCR button.
    A screenshot showing the OCR button in Microsoft Office Document Imaging
  4. Give it time to do the OCR. Once it’s done, click the Send Text to Word button.
    A screenshot showing the Send Text to Word button in Microsoft Office Document Imaging
  5. A dialog will pop up with options to send the text. If the TIFF had multiple pages, make sure to select the All Pages option. If the image had pictures/diagrams inside it that you’d wish to export too, check the option to Maintain pictures in output. Click the OK button.
    A screenshot showing the Send Text to Word options window
  6. The recognized text and any pictures it may have found will be exported to a HTML file opened by whichever version of Word you have installed.

ii. OCR directly from the Scanner

  1. Connect your scanner and load the item to scan.
  2. Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Scanning.
  3. In the scanning window, click the Scanner button and select your scanner.
    A screenshot showing the document scanner window in Microsoft Office Document Imaging
  4. Depending on the nature of the item you’re scanning, you can select a suitable color preset for it that includes Color, Grayscale or Black and White.
  5. Click the Scanning button. Your item will be scanned, and after it’s done, OCR will be run automatically.
  6. The recognized text will then be opened in MODI. Finish by click the Send Text to Word button to transfer the recognized text and any pictures to Word.

NOTE:
In the default save folder, you’ll find the HTML file containing the OCR information. Check inside the corresponding HTML folder for any pictures such as diagrams that MODI will have exported.

Advertisements

Language Support for MODI and OneNote

The OCR feature in OneNote and MODI comes embedded with support for only three languages: English, French and Spanish. By default, it will use the language that your installed MS Office is using.

You can change the language MODI uses for the OCR by doing the following:

  1. Open Microsoft Office Document Imaging.
  2. Go to Tools > Options…
  3. Select the OCR tab and then select the OCR Language dropdown.
A screenshot showing the OCR language options in Microsoft Office Document Imaging settings
Select OR Langauge

For other languages, particularly those using a completely different alphabet than what is used in English such as Greek, Korean, Chinese, Japanese, Arabic, Cyrillic (Slavic languages – Russian, Bulgarian, Serbian, Ukrainian) etc. you’ll have to install the corresponding Language Pack in order for it to work with OneNote or MODI.

1. Installing OCR Language Packs for OneNote

To install a language pack for OneNote to OCR with do the following:

Advertisements
  1. Open OneNote and go to: Options > Language
  2. Add the language from the drop-down menu, then when it appears inside the languages box, click the Not Installed link in the Proofing column.
A screenshot showing how to add a new language in OneNote's options

Doing that will take you to the Microsoft Office support site, where you can download the free language packs.

A screenshot showing language pack download options for Office

Make sure to download the correct language pack for the version of MS Office you’re using, i.e. either the 32 or 64-bit package of your MS Office version.

2. Installing OCR Language Packs for MODI

For MODI, the process is a little complicated, but there’s an excellent guide on how to go about installing the language packs here.

If all this sounds like a lot of work, you can opt to use Tesseract which has a wide support for different languages. Tesseract however uses command line, but you can find a couple of GUIs (versions with a user interface) for it online.

Tags:

Author

Kelvin Muriuki is a web content developer that's passionate about keeping the internet a useful place. He is the founder and editor of Journey Bytes, a tech blog and web design agency. Feel free to connect with him regarding the content appearing on this page or on web and content development.

Leave a Reply

Feel free to share your comments or questions with me. I may not be able to respond immediately so please check later once I've approved your comment.

Your email address will not be published. Required fields are marked *