Sadiekh, S., Ericheva, E., & Agarwal, C. (2026). Polarity-Aware Probing for Quantifying Latent Alignment in Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37896–37903. https://doi.org/10.1609/aaai.v40i44.41126