Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation

Paulo Cavalin; Pedro H. Domingues; Claudio Pinhanez

doi:10.1609/aaai.v39i22.34522

Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation

Authors

Paulo Cavalin International Business Machines
Pedro H. Domingues Pontifícia Universidade Católica do Rio de Janeiro
Claudio Pinhanez International Business Machines

DOI:

https://doi.org/10.1609/aaai.v39i22.34522

Abstract

In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

AAAI-25 / IAAI-25 / EAAI-25 Proceedings Cover

Downloads

Published

2025-04-11

How to Cite

Cavalin, P., Domingues, P. H., & Pinhanez, C. (2025). Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation. Proceedings of the AAAI Conference on Artificial Intelligence, 39(22), 23532–23540. https://doi.org/10.1609/aaai.v39i22.34522

Download Citation

Issue

Vol. 39 No. 22: AAAI-25 Technical Tracks 22

Section

AAAI Technical Track on Natural Language Processing I

Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information