Indexing of Khmer Documents through OCR

This project has ended

At a glance:

Partner:

Open Institute Cambodia

When?

February 2017 to January 2018

Where?

Cambodia

Thematic Area?

Transparency & Accountability

ICTs:

Optical Character Recognition Software (OCR)

Indexing of Khmer Documents through OCR

Challenges to address

In Cambodia a lot of important information and documents, such as older laws and research, are still only available in printed form. While scans of printed documents are easy to share, without text recognition software it is difficult to create summaries or indexes of what they contain. The texts must be indexed by hand and this is a labour intensive process which means that a lot of information is still unavailable.

Activities

Open Institute, National Institute of Telecommunication, Post and ICT (NIPTIC) and Institute of Technology of Cambodia (ITC) will collaborate to develop the Open Source Tesseract Optical Character Recognition software for Khmer from the current accuracy of 50-60% to an accuracy of 95%. From the second year onwards, the research will include testing of the software at target group organisations.

OCR software development is a complex process where the software must be trained to recognise letters and words, but also what words can occur in what contexts to make an accurate prediction for what a scanned text could be. This includes pre- and post-processing, cleaning up the scanned images and many other steps.

Open Institute Cambodia

Open Institute Cambodia was founded in Phnom Penh in 2006. The organisation is committed to and interested in contributing to development of a democratic and just society by facilitating and promoting information, communication and knowledge sharing in society through all means and tools.

http://open.org.kh/?q=en