SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network

Authors

  • Ran Jia Microsoft Research Asia
  • Qiyu Li Peking University
  • Zihan Xu Peking University
  • Xiaoyuan Jin Peking University
  • Lun Du Microsoft Research Asia
  • Haoyu Dong Microsoft Research Asia
  • Xiao Lv Microsoft Research Asia
  • Shi Han Microsoft Research Asia
  • Dongmei Zhang Microsoft Research Asia

DOI:

https://doi.org/10.1609/aaai.v37i11.26522

Keywords:

SNLP: Applications, APP: Business/Marketing/Advertising/E-Commerce, ML: Unsupervised & Self-Supervised Learning

Abstract

Spreadsheets are an important and unique type of business document for data storage, analysis and presentation. The distinction between spreadsheets and most other types of digital documents lies in that spreadsheets provide users with high flexibility of data organization on the grid. Existing related techniques mainly focus on the tabular data and are incompetent in understanding the entire sheet. On the one hand, spreadsheets have no explicit separation across tabular data and other information, leaving a gap for the deployment of such techniques. On the other hand, pervasive data dependence and semantic relations across the sheet require comprehensive modeling of all the information rather than only the tables. In this paper, we propose SheetPT, the first pre-training technique on spreadsheets to enable effective representation learning under this scenario. For computational effectiveness and efficiency, we propose the coherent chunk, an intermediate semantic unit of sheet structure; and we accordingly devise a hierarchical attention-based architecture to capture contextual information across different structural granularities. Three pre-training objectives are also designed to ensure sufficient training against millions of spreadsheets. Two representative downstream tasks, formula prediction and sheet structure recognition are utilized to evaluate its capability and the prominent results reveal its superiority over existing state-of-the-art methods.

Downloads

Published

2023-06-26

How to Cite

Jia, R., Li, Q., Xu, Z., Jin, X., Du, L., Dong, H., Lv, X., Han, S., & Zhang, D. (2023). SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12951-12958. https://doi.org/10.1609/aaai.v37i11.26522

Issue

Section

AAAI Technical Track on Speech & Natural Language Processing