Enhancing Portuguese Variety Identification with Cross-Domain Approaches

Authors

  • Hugo Sousa University of Porto INESC TEC
  • Rúben Almeida University of Porto INESC TEC Innovation Point - dst group
  • Purificação Silvano University of Porto CLUP
  • Inês Cantante University of Porto CLUP
  • Ricardo Campos INESC TEC University of Beira Interior Ci2 - Smart Cities Research Center
  • Alipio Jorge University of Porto INESC TEC

DOI:

https://doi.org/10.1609/aaai.v39i24.34705

Abstract

Recent advances in natural language processing have raised expectations for generative models to produce coherent text across diverse language varieties. In the particular case of the Portuguese language, the predominance of Brazilian Portuguese corpora online introduces linguistic biases in these models, limiting their applicability outside of Brazil. To address this gap and promote the creation of European Portuguese resources, we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the PtBrVarId corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. Although this research focuses on two Portuguese varieties, our contribution can be extended to other varieties and languages. We open source the code, corpus, and models to foster further research in this task.

Downloads

Published

2025-04-11

How to Cite

Sousa, H., Almeida, R., Silvano, P., Cantante, I., Campos, R., & Jorge, A. (2025). Enhancing Portuguese Variety Identification with Cross-Domain Approaches. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 25192–25200. https://doi.org/10.1609/aaai.v39i24.34705

Issue

Section

AAAI Technical Track on Natural Language Processing III