Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies
DOI:
https://doi.org/10.1609/aaai.v39i28.35337Abstract
Many existing benchmarks, such as MMLU, are limited to measuring large language models’ (LLM) true task understanding due to their reliance on statistical patterns in the training data. We suggest new approaches to improve how benchmarks can capture task-specific understanding in LLMs, revealing insights into their reasoning ability.Downloads
Published
2025-04-11
How to Cite
Pham, T. (2025). Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies. Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29596-29598. https://doi.org/10.1609/aaai.v39i28.35337
Issue
Section
AAAI Undergraduate Consortium