Use OCR to Create Searchable PDFs

Anyone who works with paperless technology all day every day gets quickly accustomed to the jargon – and there’s a lot of it! Newcomers can find this a little off-putting, so it might be useful to go over some of the more important concepts, such as scanning, PDF, OCR etc.

Scan To PDF

A scanner is a device attached to a computer that optically scans images, documents, handwriting etc and converts it to a digital image. Scanners can vary in their construction, they can be flatbed or they can work with an Automatic Document Feeder.  A flatbed scanner can scan just one side of 1 document each scan. A scanner which has an automatice document feeder can not only scan multiple documents in one go, some models can scan both sides of the paper, in what we call duplex scanning.

For a computer user, the benefits of converting a document to a digital image are obvious; once converted, they can be stored, backed up, edited, emailed, copied etc.

Scanners can produce digital images in a number of different formats. The most common file types among document scanners are TIFF and PDF. Unsurprisingly, PDF Scan Pro allows you to create PDF files from your paperwork. PDF is a file format created by Adobe in 1993. Although it began as a proprietary format, today it is an open standard and the THE de facto file format for document archival. As we will see later, it also has some very interesting properties that make it ideal for scanning.

What is OCR?

OCR stands for Optical Character Recognition. Put simply, OCR is the conversion of the printed word into machine readable text. It is OCR that allows us to copy text out of a searchable PDF, or search within a document. It is also OCR that allows technologies like Windows Search and Google Desktop to “read” our scanned documents.

What is a Searchable PDF?

PDFs come in one of two flavours; searchable, and non searchable. In a non searchable PDF, you are essentially dealing with only an image, a picture of a document. You are not able to search through it, or interact with the text in any way. For example, when I try to select text from a non searchable PDF, here’s what happens:

You see how we haven’t actually selected the text, we have simply selected a region of an image. We couldn’t paste this as text into an email for example.

Now let’s compare this to a searchable version of the same document:

You see the difference? In this document, we have actually selected the text from the document, and we could paste this into an email or text file – or web page! Let’s try it:

“THE SELF-ASSERTION OF THE GERMAN UNIVERSITY:
ADDRESS, DELIVERED ON THE SOLEMN ASSUMPTION
OF THE RECTORATE OF THE UNIVERSITY FREIBURG
THE RECTORATE 1933/34: FACTS AND THOUGHTS”

So what’s happening here? The OCR engine has located each word and worked out where it is on the scanned image. It then adds invisible text directly into the PDF at the right location and it is this invisible text that we are selecting. This is a great way to keep the document looking the same, but with the machine readable text.

Have a go yourself, here are the non-searchable, and searchable versions of that document.