Table Header Detection and Classification

Authors

  • Jing Fang Peking University
  • Prasenjit Mitra The Pennsylvania State University
  • Zhi Tang Peking University
  • C. Lee Giles The Pennsylvania State University

DOI:

https://doi.org/10.1609/aaai.v26i1.8206

Keywords:

table header detection, table classification, table extraction

Abstract

In digital libraries, a table, as a specific document component as well as a condensed way to present structured and relational data, contains rich information and often the only source of .that information. In order to explore, retrieve, and reuse that data, tables should be identified and the data extracted. Table recognition is an old field of research. However, due to the diversity of table styles, the results are still far from satisfactory, and not a single algorithm performs well on all different types of tables. In this paper, we randomly take samples from the CiteSeerX to investigate diverse table styles for automatic table extraction. We find that table headers are one of the main characteristics of complex table styles. We identify a set of features that can be used to segregate headers from tabular data and build a classifier to detect table headers. Our empirical evaluation on PDF documents shows that using a Random Forest classifier achieves an accuracy of 92%.

Downloads

Published

2021-09-20

How to Cite

Fang, J., Mitra, P., Tang, Z., & Giles, C. L. (2021). Table Header Detection and Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 599-605. https://doi.org/10.1609/aaai.v26i1.8206