Ph.D. thesis defense
Paper Form Classification for Information
Systems Strengthening in Developing Countries
Huguens Jean
1:00pm Friday, 19 December 2015, ITE 325b
In developing countries, people are now more likely to have access to a mobile phone than clean water, making cellular based technology the only viable medium for collecting, aggregating, and communicating local data so that it can be turned into useful information. While mobile phones have found broad application in reporting health, financial and environmental data, many data collection methods still suffer from delays, inefficiency and difficulties maintaining quality. In environments with insufficient IT support and infrastructure, and among populations with limited education and experience with technology, paper forms rather than electronic methods remain the predominant means for data collection.
To meet the digitization needs of paper driven data collection practices in developing countries, SHREDDR proposes an end-to-end architecture that transforms paper form images into structured digital information on-demand. To facilitate the automatic extraction of input regions in form images, this thesis extends the SHREDDR architecture with the necessary capabilities to efficiently classify form images according to their template document. Specifically, it introduces a novel framework for visually identifying form templates by decomposing the template identification problem into three distinct tasks: retrieval, learning and matching (RLM).
Given a query form instance, the retrieval component finds and ranks the topmost h similar templates. If h>1, the matching component uses full image registration to conduct a more rigorous assessment of the visual similarity between the query form instance and the candidate templates. After matching, the retrieval’s preliminary ranking is adjusted, if necessary. The topmost candidate template with the highest registration score satisfying a global alignment threshold denotes the input form’s template. Based on the answer obtained from matching, the learning component updates the retrieval so that it can provide a better ranking in future searches. If h=1, the RLM bypasses matching and uses the retrieved template as the final classification.
Based on the proposed framework, the present thesis investigates form classification under the conditions of known and unknown template classes. A pilot study integrating the RLM into the SHREDDR system demonstrates its classification accuracy and its impact on digitization efficiency.
Committee: Drs. Timothy Oates (Chair), Fow-Sen Choa, Janet Rutledge, Jesus Caban, Nilanjan Banerjee