Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies

Thao Pham

doi:10.1609/aaai.v39i28.35337

Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies

Authors

Thao Pham Berea College, Berea, KY

DOI:

https://doi.org/10.1609/aaai.v39i28.35337

Abstract

Many existing benchmarks, such as MMLU, are limited to measuring large language models’ (LLM) true task understanding due to their reliance on statistical patterns in the training data. We suggest new approaches to improve how benchmarks can capture task-specific understanding in LLMs, revealing insights into their reasoning ability.

AAAI-25 / IAAI-25 / EAAI-25 Proceedings Cover

Downloads

Published

2025-04-11

How to Cite

Pham, T. (2025). Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies. Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29596–29598. https://doi.org/10.1609/aaai.v39i28.35337

Download Citation

Issue

Vol. 39 No. 28: IAAI-25, EAAI-25, AAAI-25 Student Abstracts, Undergraduate Consortium and Demonstrations

Section

AAAI Undergraduate Consortium

Truth Behind the Scene: Designing Evaluations Benchmarks to Assess LLMs’ Task-Specific Understanding over Test-Taking Strategies

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information