Indexing of Khmer Documents through OCR
This project has ended
At a glance:
Partner:
Open Institute Cambodia
When?
February 2017 to January 2018
Where?
Cambodia
Thematic Area?
Transparency & Accountability
ICTs:
Optical Character Recognition Software (OCR)
Indexing of Khmer Documents through OCR
Challenges to address
In Cambodia a lot of important information and documents, such as older laws and research, are still only available in printed form. While scans of printed documents are easy to share, without text recognition software it is difficult to create summaries or indexes of what they contain. The texts must be indexed by hand and this is a labour intensive process which means that a lot of information is still unavailable.
Activities
Open Institute, National Institute of Telecommunication, Post and ICT (NIPTIC) and Institute of Technology of Cambodia (ITC) will collaborate to develop the Open Source Tesseract Optical Character Recognition software for Khmer from the current accuracy of 50-60% to an accuracy of 95%. From the second year onwards, the research will include testing of the software at target group organisations.
OCR software development is a complex process where the software must be trained to recognise letters and words, but also what words can occur in what contexts to make an accurate prediction for what a scanned text could be. This includes pre- and post-processing, cleaning up the scanned images and many other steps.