Multi-lingual document size estimation using Machine Learning
Document sizing, word counting, character counting made simpler using machine learning algorithms.
The client was a document translation company and they were in the business of translating documents of their customers to a language of their need. The customers of our client were used to submitting scanned documents of varying content structure. The client wanted to build an autonomous system of identifying the word count and line count of their customer documents. The automatic document sizing will enable them to estimate the price of their translation service. The documents were multi-lingual in nature and the document length parameters for each of the languages were required to be counted by the document parser. We were required to parse and calculate the line and word count in order to arrive at a price for the translation.
This project involves OCR conversion of documents to determine the number of words and lines on any given document. Since the documents were scanned copies of their paper-based counterparts, an OCR program was required to convert the image to text. Since all the documents contained standard fonts and languages, developing an OCR program was quite straightforward. There was no customized OCR model development required for this project and so we used the pre-trained LSTM model of Tesseract for this project. The documents were multi-lingual in nature so the OCR had to be done iteratively for each language group for better accuracy. We then packaged this solution as a library that was then integrated into their existing desktop application. Any new document arriving at the library will pass through the OCR engine and produce a report on the work and line count of each language group on the document.
This project was developed in a time frame of 8 weeks. The multi-stage ML pipeline was more than 90% accurate in the document conversion process.