Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Authors

  • Sabrina Sadiekh Independent researcher
  • Elena Ericheva Independent researcher
  • Chirag Agarwal University of Virginia, Charlottesville

DOI:

https://doi.org/10.1609/aaai.v40i44.41126

Abstract

Advances in unsupervised probes like Contrast‑Consistent Search (CCS), which reveal latent beliefs without token outputs, raise the question of whether they can reliably assess model alignment. We investigate this by examining CCS's sensitivity to harmful vs. safe statements and introducing Polarity‑Aware CCS (PA‑CCS), which evaluates whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics -- Polar‑Consistency and Contradiction Index -- to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main and one control datasets containing matched harmful-safe sentence pairs formulated by different methods (concurrent and antagonistic statements), and apply PA-CCS to 16 language models. Our results demonstrate that PA‑CCS reveals both architectural and layer-specific differences in the encoding of latent harmful knowledge. Interestingly, replacing the negation token with a meaningless marker degrades the PA‑CCS scores of models with aligned representations. In contrast, models lacking robust internal calibration do not show this degradation.

Published

2026-03-14

How to Cite

Sadiekh, S., Ericheva, E., & Agarwal, C. (2026). Polarity-Aware Probing for Quantifying Latent Alignment in Language Models. Proceedings of the AAAI Conference on Artificial Intelligence, 40(44), 37896–37903. https://doi.org/10.1609/aaai.v40i44.41126

Issue

Section

AAAI Special Track on AI Alignment