SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network
Keywords:SNLP: Applications, APP: Business/Marketing/Advertising/E-Commerce, ML: Unsupervised & Self-Supervised Learning
AbstractSpreadsheets are an important and unique type of business document for data storage, analysis and presentation. The distinction between spreadsheets and most other types of digital documents lies in that spreadsheets provide users with high flexibility of data organization on the grid. Existing related techniques mainly focus on the tabular data and are incompetent in understanding the entire sheet. On the one hand, spreadsheets have no explicit separation across tabular data and other information, leaving a gap for the deployment of such techniques. On the other hand, pervasive data dependence and semantic relations across the sheet require comprehensive modeling of all the information rather than only the tables. In this paper, we propose SheetPT, the first pre-training technique on spreadsheets to enable effective representation learning under this scenario. For computational effectiveness and efficiency, we propose the coherent chunk, an intermediate semantic unit of sheet structure; and we accordingly devise a hierarchical attention-based architecture to capture contextual information across different structural granularities. Three pre-training objectives are also designed to ensure sufficient training against millions of spreadsheets. Two representative downstream tasks, formula prediction and sheet structure recognition are utilized to evaluate its capability and the prominent results reveal its superiority over existing state-of-the-art methods.
How to Cite
Jia, R., Li, Q., Xu, Z., Jin, X., Du, L., Dong, H., Lv, X., Han, S., & Zhang, D. (2023). SheetPT: Spreadsheet Pre-training Based on Hierarchical Attention Network. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11), 12951-12958. https://doi.org/10.1609/aaai.v37i11.26522
AAAI Technical Track on Speech & Natural Language Processing