Multilingual Code Snippets Training for Program Translation

Authors

  • Ming Zhu Virginia Tech
  • Karthik Suresh Virginia Tech
  • Chandan K Reddy Virginia Tech

DOI:

https://doi.org/10.1609/aaai.v36i10.21434

Keywords:

Speech & Natural Language Processing (SNLP), Machine Learning (ML)

Abstract

Program translation aims to translate source code from one programming language to another. It is particularly useful in applications such as multiple-platform adaptation and legacy code migration. Traditional rule-based program translation methods usually rely on meticulous manual rule-crafting, which is costly both in terms of time and effort. Recently, neural network based methods have been developed to address this problem. However, the absence of high-quality parallel code data is one of the main bottlenecks which impedes the development of program translation models. In this paper, we introduce CoST, a new multilingual Code Snippet Translation dataset that contains parallel data from 7 commonly used programming languages. The dataset is parallel at the level of code snippets, which provides much more fine-grained alignments between different languages than the existing translation datasets. We also propose a new program translation model that leverages multilingual snippet denoising auto-encoding and Multilingual Snippet Translation (MuST) pre-training. Extensive experiments show that the multilingual snippet training is effective in improving program translation performance, especially for low-resource languages. Moreover, our training method shows good generalizability and consistently improves the translation performance of a number of baseline models. The proposed model outperforms the baselines on both snippet-level and program-level translation, and achieves state-of-the-art performance on CodeXGLUE translation task. The code, data, and appendix for this paper can be found at https://github.com/reddy-lab-code-research/MuST-CoST.

Downloads

Published

2022-06-28

How to Cite

Zhu, M., Suresh, K., & Reddy, C. K. (2022). Multilingual Code Snippets Training for Program Translation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11783-11790. https://doi.org/10.1609/aaai.v36i10.21434

Issue

Section

AAAI Technical Track on Speech and Natural Language Processing