The other day I got some work from a high school teacher to type some past examinations that she was compiling into a book. The test papers were quite many, a file worth to be exact.
Knowing this would be a challenge in terms of the time and effort I’ll have to set aside, I decided to look for a much quicker solution. The first thing that came to my mind was OCR (Optical Character Recognition).
OCR is basically the identification of text from image files. In layman’s terms, think of it as converting images to text.
OCR can thus save you time and money that you’d otherwise spend typing or outsourcing to professionals. In my case, I was able to reduce the workload of this particular job by about 70-80% and it would be higher were it not for the few wrongly identified characters and the touching up of some diagrams.
For my OCR needs I went with MS Office. I had considered other options before settling for it, such as the feature rich PDF-Xchange Editor that bundles OCR with its PDF viewer. Ultimately however, they all proved to be less capable compared to the OCR engine in MS Office which was more accurate and quick.
I suppose that could be attributed to them using the free Tesseract OCR engine which while powerful in its own right, tends to be outperformed by commercial alternatives.
Getting Started: Microsoft Office OCR Options
MS Office does OCR in two ways:
- Using OneNote
- Using Microsoft Office Document Imaging (MODI)
Any version of OneNote (2007-2016) will do for this purpose. For MODI however, things are a little bit different as it was discontinued. MS Office 2007 was the last version to feature it.
However you don’t necessarily need to have MS Office 2007 to use it as it can be installed separately and be used with newer versions of MS Office.
What You’ll Need
- First things first, you’ll need MS Office installed. Any version will do from Office 2007, 2010, 2013 and 2016.
- SharePoint Designer 2007 to install MODI. SharePoint Designer 2007 is provided as a free download by Microsoft. Get it from Microsoft’s download centre.
- MS Office 2007 to install MODI. If you’ve a licensed copy of MS Office 2007 already, you can use it instead of having to download SharePoint Designer 2007.
- Image to OCR
- A scanner if you want to OCR during the scanning process.
1. OCR with OneNote
1. Launch OneNote and start by creating a New Note.
2. In the ribbon, go to the Insert tab and insert the image to OCR.
|Insert Image to OCR|
3. Inside the note, right-click the inserted image and select Copy Text from Picture.
|Copy Text from Image|
4. Open MS Word or a text editor and paste the text that has been recognized.
5. You can alternatively search the text within OneNote instead of copy-pasting it elsewhere. To do that, right-click the inserted image and select Make Text in Image Searchable then select the language the text is in.
|Make Text in Image Searchable|
You can then use Ctrl+F to search for text inside the image. If it finds a match, it will be highlighted.
NOTE: If you need a different language, check the bottom of this post on how to install additional language packs.
2. OCR with Microsoft Office Document Imaging (MODI)
Step 1. Installing MODI
1. Run your SharePoint Designer 2007 or MS Office 2007 set up.
2. Select the Customize installation option.
3. Set all the available options to Not Available then expand Office Tools and set Microsoft Office Document Imaging to Run all from my Computer.
4. Now leave it to install.
Step 2: OCR with MODI
MODI OCRs in two ways:
- OCRs Image Files
- Connects with your scanner and automatically OCRs after the scanning is complete
i. OCR an Image
MODI only OCRs images that are in TIFF (*.tif, *.tiff) format. If you picture is in another format (e.g. JPEG, PNG, GIF) you can use an one of the many free image editors available online (XnView, IrfanView etc.) to convert them to TIFF.
You can even use Paint to do the conversion. Just open the image with Paint, choose to Save as then select Other Formats. In the save dialog, select the TIFF type and save the image.
Once you have your images in this format, do the follwoing:
1. Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Imaging.
2. Inside MODI, click the Open icon and select your TIFF image from the dialog.
3. Once the image is loaded inside MODI, click the Recognize Text Using OCR button.
4. Give it time to do the OCR. Once it’s done, click the Send Text to Word button.
|Send Text to Word|
5. A dialog will pop up with options to send the text. If the TIFF had multiple pages, make sure to select the All Pages option. If the image had pictures/diagrams inside it that you’d wish to export too, check the option to Maintain pictures in output. Click the OK button.
|Send Text to Word|
6. The recognized text and any pictures it may have found will be exported to a HTML file opened by whichever version of Word you have installed.
ii. OCR directly from the Scanner
1. Connect your scanner and load the item to scan.
2. Go to the start menu programs and inside Microsoft Office Tools open Microsoft Office Document Scanning.
3. In the scanning window, click the Scanner button and select your scanner.
4. Depending on the nature of the item you’re scanning, you can select a suitable color preset for it : Color, Grayscale or Black and White.
5. Click the Scanning button. Your item will be scanned and after its done OCR will be done automatically.
6. The recognized text will then be opened in MODI. Finish by click the Send Text to Word button to transfer the recognized text and any pictures to Word.
In the default save folder, you’ll find the HTML file containing the OCR information. Check inside the corresponding HTML folder for any pictures such as diagrams that MODI will have exported.
Language Support for MODI and OneNote
The OCR feature in OneNote and MODI comes embedded with support for only three languages: English, French and Spanish. By default it will use the language that your installed MS Office is using. To change the language MODI uses for the OCR do the following:
- Open Microsoft Office Document Imaging
- Go to Tools > Options…
- Select OCR and then choose OCR Language
|Select OCR Language|
For other languages, particularly those using a completely different alphabet than what is used in English such as Greek, Korean, Chinese, Japanese, Arabic, Cyrillic (Slavic languages – Russian, Bulgarian, Serbian, Ukrainian) etc. you’ll have to install the corresponding Language Pack in order for it to work with OneNote or MODI.
1. Installing OCR Language Packs for OneNote
To install a language pack for OneNote to OCR with:
- Open OneNote and go to: Options > Language.
- Add the language from the drop down menu, then when it appears inside the languages box, click the Not Installed link below the Proofing column.
That will take you to the Microsoft Office support site where you can download the free language packs. Make sure to download the correct language pack for the version of MS Office you’re using, i.e. whether 32 or 64-bit of MS Office 2010, 2013 or 2016.
|Download Language Pack|
2. Installing OCR Language Packs for MODI
For MODI, the process is a little bit complicated but there’s a really good guide on how to go about installing the language packs here.
If all this sounds like a lot of work, you can opt to use Tesseract which has a wide support for different languages. Tesseract however uses command line but you can find a couple of GUIs (versions with a user interface) for it online such as this one.