Accounting is an integral part of business finance management. Keeping track of manual bills is an exhausting task and errors are likely to be introduced while handling large numbers of the same. Our client is an accounting company who wanted to automate some parts of their bills reconciliation process. Automation will help them perform the accounting quicker with less or no errors. The customers of our client will submit the manual bills for the accounting. The client wanted to build an OCR system to convert expense receipt stubs stored as scanned documents and images. In order to achieve this, we were requirement to extract elements and fields from the expense receipt stub, namely, date, total price, tax etc
This project involves automatic OCR conversion of receipt stubs into textual CSV data. We gathered their dataset of receipt scans and performed preliminary data cleanup and grouping. Since all the documents contained standard fonts and languages, developing an OCR program was quite straightforward. There was no customised OCR model development required for this project and so we used the pre-trained LSTM model of Tesseract for this project. The next goal was to impart intelligence to the system by automatically identifying specific text fields in the OCR output. We used a natural language processing framework to model the text context from the text output of the OCR engine. We then packaged this solution as a library that was then integrated into their existing desktop application. The user will load a collection of scanned receipts and the OCR engine will produce a list of CSV files corresponding to the input files.
This project was developed in a time frame of 15 weeks. The OCR engine was very efficient and reduced the manual text conversion time to 0. The OCR accuracy was above 95%.