FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification

Authors

  • Gwok-Waa Wan Southeast University, Nanjing, Jiangsu, China National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • SamZaak Wong Southeast University, Nanjing, Jiangsu, China National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Shengchu Su Southeast University, Nanjing, Jiangsu, China
  • Chenxu Niu Texas Tech University, Lubbock, TX, USA
  • Ning Wang City University of Hong Kong, Kowloon, Hong Kong SAR, China
  • Xinlai Wan Southeast University, Nanjing, Jiangsu, China
  • Qixiang Chen National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Mengnv Xing National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Jingyi Zhang Southeast University, Nanjing, Jiangsu, China
  • Jianmin Ye Southeast University, Nanjing, Jiangsu, China
  • Yubo Wang National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Rongchang Song National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Tao Ni National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Qiang Xu National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
  • Nan Guan National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China City University of Hong Kong, Kowloon, Hong Kong SAR, China
  • Zhe Jiang Southeast University, Nanjing, Jiangsu, China National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Xi Wang Southeast University, Nanjing, Jiangsu, China National Center of Technology Innovation for EDA, Nanjing, Jiangsu, China
  • Yong Chen Texas Tech University, Lubbock, TX, USA
  • Jun Yang Southeast University, Nanjing, Jiangsu, China

DOI:

https://doi.org/10.1609/aaai.v40i2.37079

Abstract

We introduce FIXME, the first end-to-end and large-scale benchmark for evaluating Large Language Models (LLMs) in hardware design functional verification (FV). Comprising 747 tasks derived from real-world hardware designs, FIXME spans five core FV sub-sets: specification comprehension, reference model generation, testbench generation, assertion design, and RTL debugging. To ensure high data quality, we developed an AI-human collaborative framework for agile data curation and annotation. This process resulted in 25,000 lines of verified RTL, 35,000 lines of enhanced testbenches, and over 1,200 SystemVerilog Assertions. Furthermore, through expert-guided optimization within the multi-agent aided flow, we achieved a remarkable 45.57% improvement in average functional coverage, underscoring the benchmark's robustness. Through evaluation of state-of-the-art LLMs like GPT-4.1, FIXME identifies key limitations and provides actionable insights, advancing the potential of LLM-driven automation in hardware design functional verification.

Downloads

Published

2026-03-14

How to Cite

Wan, G.-W., Wong, S., Su, S., Niu, C., Wang, N., Wan, X., … Yang, J. (2026). FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification. Proceedings of the AAAI Conference on Artificial Intelligence, 40(2), 1087–1095. https://doi.org/10.1609/aaai.v40i2.37079

Issue

Section

AAAI Technical Track on Application Domains II