TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Authors

  • Chenzhuo Zhao Peking University
  • Xinda Wang Peking University
  • Yue Huang Peking University
  • Junting Lu Peking University
  • Ziqian Liu The University of Hong Kong

DOI:

https://doi.org/10.1609/aaai.v40i41.40795

Abstract

While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning—capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization.

Downloads

Published

2026-03-14

How to Cite

Zhao, C., Wang, X., Huang, Y., Lu, J., & Liu, Z. (2026). TASE: Token Awareness and Structured Evaluation for Multilingual Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(41), 34915–34922. https://doi.org/10.1609/aaai.v40i41.40795

Issue

Section

AAAI Technical Track on Natural Language Processing VI