Machine Learning for Receipts

Glenn · 9 January 2020 10:22

Automation is a key part of what we do, finding new ways to reduce the time spent on repetitive tasks allows you to spend more time running your business. Recently we have been looking at ways to streamline the receipt tagging process and today we will start beta testing a new receipt analysis tool.

The Receipt Analyser uses Optical Character Recognition (OCR) coupled with Machine Learning (ML) algorithims to detect patterns in receipts and extract key data such as supplier name, date and total amount.

receipt_machine_learning

How does it work?

Once enabled the Receipt Analyser adds a small wand icon in the Receipt Hub preview screen.

When you tap this icon the receipt will be submitted for review and in a few seconds any matches will be returned. Currently the Receipt Analyser is optimised to work on thermal till receipts and looks for the following data points.

Supplier name
Date of purchase
Total amount

You can show the raw matches by clicking on the link in the top right of the analyser output box. A supplier name match will occur when the supplier name returned exactly matches the name of the supplier already saved on your account. If the text does not precisely match, the analyser will switch to “learning mode” and will record the supplier that you proceed to manually match the receipt to. That way when similar text is extracted in future the analyser will know which supplier to automatically match to.

Receipt Analyser limitations

The Receipt Analyser is based on OCR and Machine Learning and therefore accurate extraction is based on a number of variables.

1. Document type

The Machine Learning module is trained to work with thermal till receipts. Invoices and multi-page documents will see varying degrees of success. The analyser can extract text from images (png, jpg) and PDF files.

2. Image quality

High resolution, high contrast receipts with fewer distortions will make it easier for the OCR process to accurately extract the text from the receipt image. Reduction of background noise and cropping as close as possible to the actual receipt will deliver optimum results.

3. Supplier name matching

The supplier name will look for exact matches, if an exact match is not identified the analyser will record the supplier name to which the user manually matches the receipt and will apply that same rule automatically in future.

Beta programme requirements

Initially we are looking for a relatively small number of participants to test the Receipt Analyser.

If you are routinely processing receipts in QuickFile you can request access to the programme in Help >> Additional Services >> Beta Features and then enter the code " ML0001". This particular feature will require a power user subscription to activate.

alan_mcbrien · 14 January 2020 12:29

Hi @Glenn, I applied to join the Beta programme for this new feature of “Machine Learning for Receipts” which you actioned.

However I have come to realise that I will be unable to contribute much feedback on the subject. The main reason being receipts are hardly ever raised using the “Receipt Hub” method as the usual method to raise receipts is via “Bank Tagging Rules” after uploading transactions, this is the preferred method to process receipts as there is less chance of input error; and so the “Receipt Hub” is only used for the purposes of attaching a receipt document to an already issued receipt.

The only occasion that the “Receipt Hub” is used for processing receipts are payments made by cash and as Receipt Analyser is optimised for working on thermal till receipts these are not commonly used for cash sales.

I have had a go at testing the facility on some PDQ thermal till receipts that did not require receipts raising just to see how the module works, but these were in Euros as based in France at present.

PDQ thermal receipts issued in France do not use a decimal point but a decimal comma, but this is generally the decimal seperator used throughtout Europe. I have found the analyser hardly ever identified any data, not even the date. The date not being identified maybe due to the format here is slightly different “LE 14/01/20 A 13:16” and the amount appearing as “13,50 EUR”.

Not very useful for my purposes, but hopefully this tool will find a use for someone.

Glenn · 14 January 2020 13:03

Thanks for your feedback Alan

Yes I guess in that case it will be quite limited, the ML training was optimised for decimal point based separation between pounds and pence, so I think any other formatting will present a problem.

Regarding using the Receipt Hub just for matching to existing records, we plan eventually to start pre-processing receipts which may allow us to instantly suggest matching items based on the amount.

It’s very much in the early stages for now, but in time it should improve.

Scruffies · 14 January 2020 14:07

Hi @Glenn happy to help test - I’ve a couple of months worth of uploads to go through and ‘match’ (circa 40 receipts) this week, so this is great timing!

Appreciate that it’s still early days/subject to change etc. but could you possibly expand the blog post a little on the detail of how you are technically implementing the OCR function?

Specifically I’m thinking of how the data processing and storing functions are currently working/planned to work in the future.

I’m assuming for example you’ve not hand coded your own OCR tool but (more sensibly!) are leveraging a 3rd party function and/or an API …Google’s [Tesseract] perhaps?

Is the initial processing being done locally on/within QuickFiles infrastructure? (E.g. if an existing match is found, is it processed internally)
What (if anything) is passed out to the API/3rd Party? (E.g. assuming there isn’t an existing supplier entry for ‘someexampleservice’, does this then poll an external source for similar matches?)
Learning mode ‘auto matches’ - do you see this OCR vs Supplier matching list being per QuickFile account or a single QuickFile community list? (E.g. if I assign ‘The Bessemer Hotel’ to ‘a.n.othersupplier’ what effect would that have on my future matches and other users whom also use that supplier)

To be clear I’m genuinely interested in the possibilities of additional automation - especially if the handful of widgets that get purchased on a regular basis could be auto matched with confidence to the correct supplier/invoice etc…saves me cringing at the thought of all those I’ve got to do this week.

I’m sure you’ve already got it in hand but more clarity around visibility of sensitive purchases and GDPR’s offsite data processing elements is probably wise too before a wider release, just so users whom possibly have strict(er) data controls can review before activation.

Personally I’ve no issue with Google/Amazon etc. having a clearer picture of the shampoos and clipper blades that we buy throughout the year…especially so as they deliver most of them! but that may not be true for all QuickFiles users.

Right off to hit the ‘hub’!

John.

Glenn · 14 January 2020 22:56

Right now we’re using Azure Form Recognizer which accepts the file as an input and returns a bunch of structured data. We then decide which parts are useful in terms of receipt processing. If we get a good match on the supplier name and the amount, the remaining fields can often be bound from supplier defaults.

We’ve coupled the above with some extra pattern recognition, so that if an exact supplier match is not possible, we instead record the blob of text suggested by Form Recognizer as the supplier name and link it to the supplier ID manually selected in the Receipt Hub. In essense it doesn’t need to be exact, just consistent and then the matching will do it’s thing. In theory the matching efficiency should improve over time.

It’s not a silver bullet and there’s huge divergence in receipt / invoice layouts to limit the ability of current ML technology to work flawlessly. It should however prove to be useful, particularly if you tend to process high volumes of receipts from a small pool of suppliers.

gjwguk · 20 January 2020 11:24

Had a minor little problem with a Poundstretcher receipt - the receipt was able to be read but Poundstretcher has a sequence of lines:

Total to Pay
Cash tendered
Change
The machine learning picked up the change instead of the Total to Pay. Something to watch out for!

Poundstretcher Example.pdf (129.2 KB)

rhc · 27 January 2020 17:58

Hi Glenn,
I just want to give a little feedback about this feature. First of all I really like it. The software has still problems to catch everything correctly but I am pretty sure you are still in the testing process. Just 1 little thing because it sounds a bit funny. Today I renewed my power subscription and uploaded the receipt to quickfile (I saved the receipt on my account as an pdf on my computer an uploaded it to quickfile via email). I thought I have to test your software with your invoice . Your scanner did catch the invoice pretty well, the only thing which was wrong was the amount. For the amount I paid your software took your VAT reg. Number. I thought it is maybe interesting for you to know about it. It was not a big thing, I changed the amount and fine.
Oh I forgot. I took a screen shot if you are interested but I did not want to upload it here because to much private data but I can private message you, if you want the screen shot.

Glenn · 27 January 2020 22:29

Invoice type layouts are more of a challenge as the ML was trained primarily for receipt formats. That said I’ve seen reasonable accuracy when it comes to testing invoices.

It’s interesting that it picked up the VAT number, we’ll do a few tests to determine why that is.

drew4 · 16 February 2021 12:57

Thank you for this information Mathew. Im not sure that I’m using this correctly. It states on the post you sent that you hit the wand and it should after a period of time start to recognise some of the suppliers. As yet no joy so I’m thinking I might be doing something wrong.
My perception is that you take the picture of the receipt and hit tag later. Then on pc hit the wand and it should find some details.

Im wondering if this is wrong. As I now am thinking that I take the picture as above and wait until the bank feeds come in on the PC. Then when they are in I hit the magic wand to see if the system finds it via the bank feed info so it tags it to that.
Is this the correct way as I seem to be doubling the work load or even worse just deleting the bank feed info as its a duplicate.
Sorry to sound a bit daft but as a sole trader this system is all new to me although its great and have upgraded to everything. I used wave before and that was idiot proof.
Any light on the correct way to input this via the automation would be helpful.
Keep up the good work.
Drew

rhc · 16 February 2021 13:14

The “magic wand” is still in testing I believe. But I can say, after a while, the system recognises parts of the invoices but, so far, never to 100%, but it helps a lot, specially in time saving.

The magic wand will not find the related bank account entry. So, you can and should create invoices from your receipts but you shouldn’t tag them as paid. That will create duplicate entries in your bank account if you have auto feed enabled.
Instead, wait until your bank account in quickfile shows this particular transaction (mostly the next day) and tag it from there to the already created invoice.