Training Tesseract OCR for Typewritten Hindi Documents

  • Jaspreet Kaur, Dr. Vishal Goyal, Dr. Manish Kumar
Keywords: Error Rate, Optical Character Recognition, Tesseract.

Abstract

Optical Character Recognition (OCR) is a prominent area of research in relevance with Indian script. One such popular open-source software is Tesseract. Tesseract is proficient in recognizing over hundred languages comprising Hindi. Tesseract has proven good in handling Hindi documents that are machine printed, but when it comes to typewriter typed Hindi documents, the effectiveness gets reduced and the obtained results are not too good. There is no OCR available that is capable of recognizing text from typewriter typed Hindi documents. Several academic documents are available in the form of a thesis, research papers, books, etc. The functionality of the existing OCR needs to be extended for recognizing typewriter typed documents. The research paper elaborates on the working of Tesseract OCR and highlights the effectiveness achieved in obtaining more precise results.

Published
2021-09-01
How to Cite
Dr. Manish Kumar, J. K. D. V. G. (2021). Training Tesseract OCR for Typewritten Hindi Documents . Design Engineering, 10471- 10486. Retrieved from http://www.thedesignengineering.com/index.php/DE/article/view/3926
Section
Articles