SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network

Ran Jia; Qiyu Li; Zihan Xu; Xiaoyuan Jin; Lun Du; Haoyu Dong; Xiao Lv; Shi Han; Dongmei Zhang

doi:10.1609/aaai.v37i11.26522

Authors

Ran Jia Microsoft Research Asia
Qiyu Li Peking University
Zihan Xu Peking University
Xiaoyuan Jin Peking University
Lun Du Microsoft Research Asia
Haoyu Dong Microsoft Research Asia
Xiao Lv Microsoft Research Asia
Shi Han Microsoft Research Asia
Dongmei Zhang Microsoft Research Asia

DOI:

https://doi.org/10.1609/aaai.v37i11.26522

Keywords:

SNLP: Applications, APP: Business/Marketing/Advertising/E-Commerce, ML: Unsupervised & Self-Supervised Learning

Abstract

Spreadsheets are an important and unique type of business document for data storage, analysis and presentation. The distinction between spreadsheets and most other types of digital documents lies in that spreadsheets provide users with high flexibility of data organization on the grid. Existing related techniques mainly focus on the tabular data and are incompetent in understanding the entire sheet. On the one hand, spreadsheets have no explicit separation across tabular data and other information, leaving a gap for the deployment of such techniques. On the other hand, pervasive data dependence and semantic relations across the sheet require comprehensive modeling of all the information rather than only the tables. In this paper, we propose SheetPT, the first pre-training technique on spreadsheets to enable effective representation learning under this scenario. For computational effectiveness and efficiency, we propose the coherent chunk, an intermediate semantic unit of sheet structure; and we accordingly devise a hierarchical attention-based architecture to capture contextual information across different structural granularities. Three pre-training objectives are also designed to ensure sufficient training against millions of spreadsheets. Two representative downstream tasks, formula prediction and sheet structure recognition are utilized to evaluate its capability and the prominent results reveal its superiority over existing state-of-the-art methods.

SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Subscription