The other day I got some work from a high school teacher to type some past examinations that she was compiling into a book. The test papers were quite many, a file worth to be exact.
Knowing this would be a challenge in terms of the time and effort I’ll have to set aside, I decided to look for a much quicker solution. The first thing that came to my mind was OCR (Optical Character Recognition).
OCR is basically the identification of text from image files. In layman’s terms, think of it as converting images to text.
OCR can thus save you time and money that you’d otherwise spend typing or outsourcing to professionals. In my case, I was able to reduce the workload of this particular job by about 70-80%, and it would be higher were it not for the few wrongly identified characters and the touching up of some diagrams.
For my OCR needs, I went with MS Office. I had considered other options before settling for it, such as the feature rich PDF-XChange Editor that bundles OCR with its PDF viewer. Ultimately, however, they all proved to be less capable compared to the OCR engine in MS Office which was more accurate and faster.
I suppose that could be attributed to them using the free Tesseract OCR engine which, while powerful in its own right, tends to be outperformed by commercial alternatives.
Getting Started: Microsoft Office OCR Options
MS Office does OCR in two ways:
- Using OneNote
- Using Microsoft Office Document Imaging (MODI)
All MS Office versions of OneNote 2007 and later will do for this purpose. Note however that starting with Office 2019, the OneNote app for Windows 10 has superseded the past Office versions, but it doesn’t include the OCR option.
For MODI, things are a little different, as it was long discontinued. MS Office 2007 was the last version to feature it. However, you don’t necessarily need to have MS Office 2007 as it can be installed separately and be used with newer versions of MS Office.
What You’ll Need
- OneNote (any version from Office 2007-2016)
- For the standalone OCR function, you’ll need SharePoint Designer 2007 to install MODI. You can get it from two sources:
- the standalone SharePoint Designer 2007 that is provided as a
free download by Microsoft. MS pulled down the link from their download center; get it from Softpedia instead. - If you’ve a licensed copy of MS Office 2007 already, you can use it instead of having to download SharePoint Designer 2007.
- the standalone SharePoint Designer 2007 that is provided as a
- Image to OCR
- A scanner if you want to OCR during the scanning process.
1. How to OCR with OneNote
- Launch OneNote and start by creating a New Note.
- In the ribbon, go to the Insert tab and insert the image to OCR.
- Inside the note, right-click the inserted image and select Copy Text from Picture.
- Open MS Word or a text editor and paste the text that has been recognized.
- You can alternatively search the text within OneNote instead of copy-pasting it elsewhere. To do that, right-click the inserted image and select Make Text in Image Searchable, then select the language the text is in.
- You can then use Ctrl+F to search for text inside the image. If it finds a match, it will be highlighted.
NOTE: If you need a different language, check the languages section on how to install additional language packs.
2. How to OCR with Microsoft Office Document Imaging (MODI)
Step 1. Installing MODI
- Run your SharePoint Designer 2007 or MS Office 2007 installer.
- Select the Customize installation option.
- Set all the available options to Not Available then expand Office Tools and set Microsoft Office Document Imaging to Run all from my Computer.
- Now leave it to install.
Step 2: OCR with MODI
MODI can run OCR in either of two ways:
- By running the OCR on image files
- By connecting to your scanner and automatically running the OCR after the document is completed scanning
i. OCR an Image
MODI only OCRs images that are in TIFF (*.tif, *.tiff) format. If you picture is in another format (e.g. JPEG, PNG, GIF) you can use an one of the many free image editors available online (XnView, IrfanView etc.) to convert them to TIFF.
You can even use Paint to do the conversion. Just open the image with Paint, choose to Save as then select Other Formats. In the save dialog, select the TIFF type and save the image.
Once you have your images in this format, do the follwoing:
- Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Imaging.
- Inside MODI, click the Open icon and select your TIFF image from the dialog.
- Once the image is loaded inside MODI, click the Recognize Text Using OCR button.
- Give it time to do the OCR. Once it’s done, click the Send Text to Word button.
- A dialog will pop up with options to send the text. If the TIFF had multiple pages, make sure to select the All Pages option. If the image had pictures/diagrams inside it that you’d wish to export too, check the option to Maintain pictures in output. Click the OK button.
- The recognized text and any pictures it may have found will be exported to a HTML file opened by whichever version of Word you have installed.
ii. OCR directly from the Scanner
- Connect your scanner and load the item to scan.
- Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Scanning.
- In the scanning window, click the Scanner button and select your scanner.
- Depending on the nature of the item you’re scanning, you can select a suitable color preset for it that includes Color, Grayscale or Black and White.
- Click the Scanning button. Your item will be scanned, and after it’s done, OCR will be run automatically.
- The recognized text will then be opened in MODI. Finish by click the Send Text to Word button to transfer the recognized text and any pictures to Word.
NOTE:
In the default save folder, you’ll find the HTML file containing the OCR information. Check inside the corresponding HTML folder for any pictures such as diagrams that MODI will have exported.
Language Support for MODI and OneNote
The OCR feature in OneNote and MODI comes embedded with support for only three languages: English, French and Spanish. By default, it will use the language that your installed MS Office is using.
You can change the language MODI uses for the OCR by doing the following:
- Open Microsoft Office Document Imaging.
- Go to Tools > Options…
- Select the OCR tab and then select the OCR Language dropdown.
For other languages, particularly those using a completely different alphabet than what is used in English such as Greek, Korean, Chinese, Japanese, Arabic, Cyrillic (Slavic languages – Russian, Bulgarian, Serbian, Ukrainian) etc. you’ll have to install the corresponding Language Pack in order for it to work with OneNote or MODI.
1. Installing OCR Language Packs for OneNote
To install a language pack for OneNote to OCR with do the following:
- Open OneNote and go to: Options > Language.
- Add the language from the drop-down menu, then when it appears inside the languages box, click the Not Installed link in the Proofing column.
Doing that will take you to the Microsoft Office support site, where you can download the free language packs.
Make sure to download the correct language pack for the version of MS Office you’re using, i.e. either the 32 or 64-bit package of your MS Office version.
2. Installing OCR Language Packs for MODI
For MODI, the process is a little complicated, but there’s an excellent guide on how to go about installing the language packs here.
If all this sounds like a lot of work, you can opt to use Tesseract which has a wide support for different languages. Tesseract however uses command line, but you can find a couple of GUIs (versions with a user interface) for it online.