A Mathematical Philosophy of Explanations in Mechanistic Interpretability

Authors

  • Kola Ayonrinde UK AI Security Institute
  • Louis Jaburi Independent Researcher

DOI:

https://doi.org/10.1609/aies.v8i1.36547

Abstract

Mechanistic Interpretability aims to understand neural net- works through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability re- search is a principled approach to understanding models be- cause neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Downloads

Published

2025-10-15

How to Cite

Ayonrinde, K., & Jaburi, L. (2025). A Mathematical Philosophy of Explanations in Mechanistic Interpretability. Proceedings of the AAAI ACM Conference on AI, Ethics, and Society, 8(1), 265–278. https://doi.org/10.1609/aies.v8i1.36547