Effects of Image Quality on OCR success

What determines OCR success rates?

There are so many questions that potential customers and business users have about OCR. The initial misconceptions with OCR start with the recognition rates, “does your software provide 98%+ recognition of invoice data?”. Without realising, they have asked an incredibly loaded question that is on a par with, “can you get 90%+ straight throughput?” We’ve never seen the invoices the organisation receives, we have no knowledge of whether they have a diligent procurement process and whether suppliers put the invoice data in accordance with the Purchase Order. What we’ll explain in this blog is what impacts OCR and what you can expect to gain from implementing such a solution.

The key areas are; What is a PDF?; Image quality; Image resolution; Supplier data; Invoice layout/labels.

What is a PDF?

There are many software solutions that can produce a PDF, which means there can be many variations in what a PDF is. However, there are two main formats, a PDF Image and a PDF/A. Visually, the difference between the two may not be obvious, the key way to check is can you highlight the text. A PDF Image to coin a phrase is exactly what is says it is, an image. An image file will rely on a true OCR, the system will need to scan and read every single character that is on the invoice. A PDF/A does not require OCR, because one of the requirements for a PDF/A is that all fonts must be embedded and also must be legally embeddable for unlimited, universal rendering. In simple terms, all the text is available, meaning that 100% of the characters are known. Therefore, if your business receives 100% PDF/A documents then the character recognition is 100%, the variable is when image files are received and the ability to identify the characters on these is dependent on the image quality.

Image Quality

Focussing on PDF image files, there is a correct approach and a wrong option that is too often seen. Starting with the incorrect approach, don’t ask/allow the supplier to send you a scanned invoice image. This can be unintentional but your communication must be clear. For example, “please send your invoices to us in PDF format” is not sufficient. Some suppliers will print a document out and scan it in to produce a PDF. When they do, we have seen supplier scanned images arrive and they look like this:

If the image is speckly, streaked, too dark or light then the pattern of dots seen by the OCR engine will not be correctly interpreted as characters. The value in grey is the OCR interpretation of the number above and has one correct character out of six, which is a 16.66% success rate. This is caused by the grainy image. If you are receiving documents like this, it is common to blame the OCR application for not reading the data, however, the image quality is not sufficient to enable it to read the characters. Therefore, if half of your invoices came in like this, the success rate for invoice number would be 50%, therefore the 98%+ recognition question would result in a false promise if we said “yes” because it is dependent on the records you receive.

Which PDF format do you want?

Request the right form of PDF document from your suppliers:

PDF directly from their Finance Application (ERP)
If the supplier is producing invoices with Excel or Word then please ensure to export and save the file as a PDF before sending
If they cannot do the above, ask for a paper copy

The final point above sounds like taking a step backwards but if a supplier cannot send you a PDF/A, then ask for a paper copy of the invoice. The first two methods will likely result in a PDF/A being received, the paper copy is then under your complete control and can still be converted to a PDF/A by a modern scanner.

Modern scanners and capture software have features to help create clean scans, even with difficult documents. Getting a clean image affects extraction rates more than anything else so it is worth investing in the right scanners and ensuring that suppliers send you clean images.

Above all DO NOT WRITE ON OR MARK THE INVOICE BEFORE IT IS SCANNED. Ticks and comments, date stamps and highlighter will ruin the OCR results, it doesn’t help the results by circling the invoice number because how will it know? They aren’t pre-programmed to look for an invoice number with a circle around it.

Image Resolution – the technical bit

This is how many dots per inch the page is been scanned and captured at. For good OCR, even with clean images, the resolution of the scan must be 300 Dots Per Inch (DPI) as opposed to 200 DPI. Higher resolutions, greater than 300dpi, are not necessarily better because few OCR engines can make use of them. The DPI is usually set by altering the scan configuration in the document capture software.

Invoice Layout and Labels

cloud-ap cropped

Invoice recognition software is, of necessity, designed to deal with varied layouts but it makes certain assumptions, which help to improve accuracy. It generally expects certain values to help index the invoice data, for example, Invoice number is located by using keywords to help identify its location. Variations such as ‘Invoice Num’ and ‘Inv Num’ are values the system will expect to find and will use the value that is correspondingly located to identify the invoice number.

Logic to index invoices is built into systems, what they cannot do, is index or link certain data without an appropriate indicator. A common example of this is invoice date because you might see an invoice date with the number in the top third of the page. A human can look at that invoice and state with a high degree of certainty that the value would be the invoice date, however, a machine cannot provide the same level of consideration if there is no tag or value to link it.

Other considerations are the specific words that can enable the automatic indexing of the invoice data, words such as ‘Our Reference’ could mean two things. One is the number for the document in question, the other could be an internal order reference that exists with the supplier. A machine cannot apply context in the same way as a human can, what can be achieved though is where invoice data is not read correctly, training of an invoice can be undertaken to derive the correct values.

Is OCR blamed for issues that are data-related? Supplier Data

In short, yes, we have seen a number of times by customers and businesses that seek help with OCR challenges not realising that the ‘OCR’ is not the underlying problem. A supplier hasn’t been identified from the invoice or the PO number hasn’t been read when its clearly obvious. When engaging with customers, we are always clear that supplier recognition is solely dependent on master data being correct. We’ll assume the document is a PDF/A, therefore 100% of the characters have been identified and it’s a matter of correlating the invoice data to the supplier master data to enable automatic indexing and then matching to a PO.

If an invoice address for the supplier is Nottingham and the data record in the ERP is in Birmingham (real example) then it will not identify the supplier site based solely on name, this is not sufficient for correct identification. Subsequently, if the supplier site isn’t confirmed, then the PO Number cannot be matched correctly and would also not be indexed against the record. The two values are linked, from a process perspective the supplier should not be identified from the PO number because the invoice is the legal record, not the PO.

In this type of scenario, a system can have a 100% character identification but only a 80% success rate in terms of indexing the document header. Invoice number, date, the billed to legal entity, net amount, tax amount and the total will have all been identified but the Supplier and PO Number have not been. This record cannot go straight through, which means for this example we have the following breakdown:

100% Characters identified
80% Header data correctly indexed
0% Straight throughput
100% blame placed by the user on OCR, when the sole cause is master data!

Once the master data has been updated and cleansed we always see fantastic results with the identification rate that takes place within an OCR/indexing process. If the supplier address had matched the breakdown in this scenario would be:

100% Characters identified
100% Header data correctly indexed
100% Straight through

As you can see, from the two initial questions on character recognition and straight throughput, whilst the right system can achieve those things, the system cannot be relied upon to achieve those results. Factors such as image quality and master data will factor into the ability to get the right results.

Before automated invoice processing has been implemented, contact all your suppliers explaining the new system and how they can ensure the best results and how their invoices will be paid faster if they do so. Review the invoices from your top twenty suppliers to see if there are any value identification issues. If so, it is well worth engaging with them directly to see if they can fix any of these issues. The main incentive to suppliers is rapid payment but it is possible to provide other incentives to encourage best practice. Our experience shows that extraction rates can be improved by suppliers making small changes.