Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models

Authors

  • Juan Antonio Ramirez-Orta Department of Computer Science, Dalhousie University
  • Eduardo Xamena Institute of Research in Social Sciences and Humanities (ICSOH), Universidad Nacional de Salta - CONICET
  • Ana Maguitman Department of Computer Science and Engineering, Universidad Nacional del Sur Institute for Computer Science and Engineering, UNS–CONICET
  • Evangelos Milios Department of Computer Science, Dalhousie University
  • Axel J. Soto Department of Computer Science and Engineering, Universidad Nacional del Sur Institute for Computer Science and Engineering, UNS–CONICET

DOI:

https://doi.org/10.1609/aaai.v36i10.21369

Keywords:

Speech & Natural Language Processing (SNLP)

Abstract

In this paper, we propose a novel method to extend sequence-to-sequence models to accurately process sequences much longer than the ones used during training while being sample- and resource-efficient, supported by thorough experimentation. To investigate the effectiveness of our method, we apply it to the task of correcting documents already processed with Optical Character Recognition (OCR) systems using sequence-to-sequence models based on characters. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.

Downloads

Published

2022-06-28

How to Cite

Ramirez-Orta, J. A., Xamena, E., Maguitman, A., Milios, E., & Soto, A. J. (2022). Post-OCR Document Correction with Large Ensembles of Character Sequence-to-Sequence Models. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11192-11199. https://doi.org/10.1609/aaai.v36i10.21369

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing