Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

Juan Antonio Ramirez-Orta; Eduardo Xamena; Ana Maguitman; Evangelos Milios; Axel J. Soto

doi:10.1609/aaai.v36i10.21369

Authors

Juan Antonio Ramirez-Orta Department of Computer Science, Dalhousie University
Eduardo Xamena Institute of Research in Social Sciences and Humanities (ICSOH), Universidad Nacional de Salta - CONICET
Ana Maguitman Department of Computer Science and Engineering, Universidad Nacional del Sur Institute for Computer Science and Engineering, UNS–CONICET
Evangelos Milios Department of Computer Science, Dalhousie University
Axel J. Soto Department of Computer Science and Engineering, Universidad Nacional del Sur Institute for Computer Science and Engineering, UNS–CONICET

DOI:

https://doi.org/10.1609/aaai.v36i10.21369

Keywords:

Speech & Natural Language Processing (SNLP)

Abstract

In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample- and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription