A Mathematical Philosophy of Explanations in Mechanistic Interpretability

Kola Ayonrinde; Louis Jaburi

doi:10.1609/aies.v8i1.36547

A Mathematical Philosophy of Explanations in Mechanistic Interpretability

Authors

Kola Ayonrinde UK AI Security Institute
Louis Jaburi Independent Researcher

DOI:

https://doi.org/10.1609/aies.v8i1.36547

Abstract

Mechanistic Interpretability aims to understand neural net- works through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability re- search is a principled approach to understanding models be- cause neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Downloads

Published

2025-10-15

How to Cite

Ayonrinde, K., & Jaburi, L. (2025). A Mathematical Philosophy of Explanations in Mechanistic Interpretability. Proceedings of the AAAI ACM Conference on AI, Ethics, and Society, 8(1), 265–278. https://doi.org/10.1609/aies.v8i1.36547

Download Citation

Issue

Vol. 8 No. 1 (2025): Proceedings of the Eighth AAAI/ACM Conference on AI, Ethics, and Society (AIES-25) - Main Track I

Section

Main Track I