[1]
D. Feng, B. Qin, C. Huang, Y. Huang, Z. Zhang, and W. Lei, “LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets”, AAAI, vol. 39, no. 26, pp. 27277–27285, Apr. 2025.