Proceedings of the AAAI Symposium Series
https://ojs.aaai.org/index.php/AAAI-SS
<p>The AAAI Symposium Series, previously published as AAAI Technical Reports, are held three times a year (Spring, Summer, Fall) and are designed to bring colleagues together to share ideas and learn from each other’s artificial intelligence research. The series affords participants a smaller, more intimate setting where they can share ideas and learn from each other’s artificial intelligence research. Topics for the symposia change each year, and the limited seating capacity and relaxed atmosphere allow for workshop-like interaction. The format of the series allows participants to devote considerably more time to feedback and discussion than typical one-day workshops. It is an ideal venue for bringing together new communities in emerging fields.<br /><br />The AAAI Spring Symposium Series is typically held during spring break (generally in March) on the west coast. The AAAI Summer Symposium Series is the newest in the annual set of meetings run in parallel at a common site. The inaugural 2023 Summer Symposium Series was held July 17-19, 2023, in Singapore. The AAAI Fall Symposium series is usually held on the east coast during late October or early November.</p>AAAI Pressen-USProceedings of the AAAI Symposium Series2994-4317Designing Safety Specifications for Clinical AI: A Case Study
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36898
Clinical AI models increasingly inform care decisions, yet implicit assumptions about data timing, label semantics, calibration, and operating thresholds are rarely specified or monitored, causing subtle failures with standard metrics. We present executable safety contracts, lightweight, task-level specifications enforced as runtime checks for hospital length-of-stay prediction. The specifications capture preconditions (data integrity, index-time alignment, censoring), postconditions (admissible outputs, alert-budget bounds), and invariants (coverage/calibration targets, subgroup equity). We implement these checks in a Python pipeline and evaluate them on a single-center MIMIC-IV cohort and a multi-center eICU-style cohort using simple baselines (logistic regression, gradient boosting) with conformal intervals and post-hoc calibration. The contracts exposed hazards that MAE (Mean Absolute Error), AUC (Area Under the ROC Curve), or ECE (Expected Calibration Error) alone missed, for example, acceptable point error with severe under-coverage in eICU, well-calibrated probabilities that nonetheless violated alert-rate constraints, and dataset-specific fairness gaps. Lightweight remedies such as conformal radius tuning, threshold/alert-scope selection, and calibration often restored compliance without degrading point performance, while clarifying when deeper modeling or policy changes were needed. Overall, the case study shows that Design by Contract principles extend beyond APIs to system-level specifications for clinical ML, providing a practical way to state safety expectations, check them with minimal compute, and make violations actionable.Shibbir Ahmed
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237129430210.1609/aaaiss.v7i1.36898EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36899
Large language model (LLM) assistants are increasingly integrated into enterprise workflows, raising new security concerns as they bridge internal and external data sources. This paper presents an in-depth case study of EchoLeak (CVE-2025-32711), a zero-click prompt injection vulnerability in Microsoft 365 Copilot that enabled remote, unauthenticated data exfiltration via a single crafted email. By chaining multiple bypasses--evading Microsoft’s XPIA (Cross Prompt Injection Attempt) classifier, circumventing link redaction with reference-style Markdown, exploiting auto-fetched images, and abusing a Microsoft Teams proxy allowed by the content security policy, EchoLeak achieved full privilege escalation across LLM trust boundaries without user interaction. We analyze why existing defenses failed, and outline a set of engineering mitigations including prompt partitioning, enhanced input/output filtering, provenance-based access control, and strict content security policies. Beyond the specific exploit, we derive generalizable lessons for building secure AI copilots, emphasizing the principle of least privilege, defense-in-depth architectures, and continuous adversarial testing. Our findings establish prompt injection as a practical, high-severity vulnerability class in production AI systems and provide a blueprint for defending against future AI-native threats.Pavan ReddyAditya Sanjay Gujral
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237130331110.1609/aaaiss.v7i1.36899Quantum Network Science: Linking Graph Structure to Entanglement Performance
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36900
Quantum communication networks—the substrate of a future quantum Internet—demand analytical tools that account for entanglement, fidelity, and quantum-specific constraints absent from classical models. In this paper, we introduce the Quantum Network Science (QNS) framework that adapts core network metrics to the quantum setting through fidelity-, success-probability-, and capacity-aware weighting. We formalize centrality (including Quantum PageRank and continuous-time quantum-walk variants), community structure on entanglement graphs, and robustness/percolation with fidelity thresholds. The framework is validated via analytic motifs and controlled simulations on Erdős–Rényi, scale-free, and small-world topologies, as well as satellite-assisted versus fiber-only designs. Our results show that (i) fidelity weighting reorders structural importance and can reconnect networks that appear fragmented classically; (ii) heavy-tailed degree patterns improve tolerance to random failures but heighten vulnerability to targeted hub attacks; (iii) small-world shortcuts induced by long-range quantum links shrink path lengths; and (iv) overlapping “connected components” emerge from entanglement swapping, motivating revised connectivity baselines. We also discuss design implications—degree caps and hub hardening, link-type diversity, multipath routing, and buffering policies—and outline extensions to temporal and multilayer modeling that couple the quantum plane with its classical control layer. QNS thus offers a principled, measurement-oriented foundation for analyzing, comparing, and engineering resilient, high-capacity quantum networks.Rawan AlmakinahM Abdullah Canbaz
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237131332210.1609/aaaiss.v7i1.36900Quantum Diffusion Model for Quark and Gluon Jet Generation
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36901
Diffusion models have demonstrated remarkable success in image generation, but they are computationally intensive and time-consuming to train. In this paper, we introduce a novel diffusion model that benefits from quantum computing techniques in order to mitigate computational challenges and enhance generative performance within high energy physics data. The fully quantum diffusion model replaces Gaussian noise with random unitary matrices in the forward process and incorporates a variational quantum circuit within the U-Net in the denoising architecture. We run evaluations on the structurally complex quark and gluon jets dataset from the Large Hadron Collider. The results demonstrate that the fully quantum and hybrid models are competitive with a similar classical model for jet generation, highlighting the potential of using quantum techniques for machine learning problems.Mariia BaidachnaRey GuadarramaGopal Ramesh DahaleTom MagorschIsabel PedrazaKonstantin T. MatchevKatia MatchevaKyoungchul KongSergei Gleyzer
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237132332910.1609/aaaiss.v7i1.36901Quantum Variational Rewinding for Time Series Anomaly Detection
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36902
We explore a new quantum computing approach to time series anomaly detection (TAD). Our approach - Quantum Variational Rewinding (QVR) - trains a family of parameterized unitary time-devolution operators to cluster normal time series instances encoded within quantum states. Unseen time series are assigned an anomaly score based upon their distance from the cluster center, which, beyond a given threshold, classifies anomalous behaviour. We apply QVR to identify anomalous trading activity in cryptocurrency market data and blockchain. Finally, we study our algorithm on IBM’s Falcon r5.11H family of superconducting transmon QPUs, where anomaly score errors resulting from hardware noise are shown to be reducible by as much as 14% on average using advanced error mitigation techniques.Jack S. BakerHaim HorowitzSantosh Kumar RadhaStenio FernandesColin JonesNoorain NooraniVladimir SkavyshPhilippe LamontagneBarry C. Sanders
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237133033810.1609/aaaiss.v7i1.36902Towards Practical Quantum Kernels for Network Intrusion Detection
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36903
With cyber-attacks becoming increasingly sophisticated, modern network intrusion detection systems (NIDSs) are relying on machine learning (ML) methods for their flexibility in detecting subtle anomalous patterns in huge amounts of network data. However, classical ML methods such as support vector machines (SVMs) often rely on the conversion of low-dimensional data into a high-dimensional space, creating complex linear systems that are time-consuming to evaluate on large data inputs such as network flow logs. We propose addressing this limitation by employing a hybrid quantum-classical ML model to leverage quantum computing's (QC's) superiority in high-dimensional areas. We constructed a quantum kernel with an SVM model and evaluated it on four different network attacks from a modern intrusion detection dataset. Results reveal an average hardware accuracy rate of 85% with noticeably small deviations between runs, suggesting that quantum kernels may be a noise-resistant solution. We evaluated these results alongside classical and noiseless quantum simulator benchmarks.Mary L. CotrupiBrian R. Callahan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237133934210.1609/aaaiss.v7i1.36903Prediction of Stocks Index Price Using Quantum GANs
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36904
This paper investigates the application of Quantum Generative Adversarial Networks (QGANs) for stock price prediction. Financial markets are inherently complex, marked by high volatility and intricate patterns that traditional models often fail to capture. QGANs, leveraging the power of quantum computing, offer a novel approach by combining the strengths of generative models with quantum machine learning techniques. We implement a QGAN model tailored for stock price prediction and evaluate its performance using historical market data. Results demonstrate that QGANs can generate synthetic data closely resembling actual market behavior, leading to enhanced prediction accuracy. The experiment was conducted using stock index price data and the AWS Braket SV1 simulator for training QGAN circuits. The quantum-enhanced model outperforms classical LSTM and GAN models in both convergence speed and prediction accuracy. This research marks a key step toward integrating quantum computing in financial forecasting, offering potential advantages in speed and precision over traditional methods. These findings hold promising implications for traders, financial analysts, and researchers.Sangram DeshpandeGopal Ramesh DahaleSai Nandan MorapakulaUday Wad
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237134334910.1609/aaaiss.v7i1.36904Vectorized Attention with Learnable Encoding for Quantum Transformer
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36905
Vectorized quantum block encoding provides a way to embed classical data into Hilbert space, offering a pathway for quantum models, such as Quantum Transformers (QT), that replace classical self-attention with quantum circuit simulations to operate more efficiently. Current QTs rely on deep-parameterized quantum circuits (PQCs), rendering them vulnerable to QPU noise, and thus hindering their practical performance. In this paper, we propose the Vectorized Quantum Transformer (VQT), a model that supports ideal masked-attention matrix computation through quantum approximation simulation and efficient training via vectorized nonlinear quantum encoder, yielding shot-efficient and gradient-free quantum circuit simulation (QCS) and reduced classical sampling overhead. In addition, we demonstrate an accuracy comparison for IBM and IonQ in quantum circuit simulation and competitive results in benchmarking natural language processing tasks on IBM’s state-of-the-art, high-fidelity Kingston QPU. Our noise intermediate-scale quantum (NISQ)-friendly VQT approach unlocks a novel architecture for end-to-end machine learning in quantum computing.Ziqing GuoZiwen PanAlex KhanJan Balewski
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237135035710.1609/aaaiss.v7i1.36905BenchRL-QAS: Benchmarking Reinforcement Learning Algorithms for Quantum Architecture Search
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36906
We present BenchRL-QAS, a unified benchmarking framework for reinforcement learning (RL) in quantum architecture search (QAS) across a spectrum of variational quantum algorithm tasks on 2- to 8-qubit systems. Our study systematically evaluates 9 different RL agents, including both value-based and policy-gradient methods, on quantum problems such as variational eigensolver, quantum state diagonalization, variational quantum classification (VQC), and state preparation, under both noiseless and noisy execution settings. To ensure fair comparison, we propose a weighted ranking metric that integrates accuracy, circuit depth, gate count, and training time. Results demonstrate that no single RL method dominates universally, the performance dependents on task type, qubit count, and noise conditions providing strong evidence of no free lunch principle in RL-QAS. As a byproduct we observe that a carefully chosen RL algorithm in RL-based VQC outperforms baseline VQCs. BenchRL-QAS establishes the most extensive benchmark for RL-based QAS to date, codes and experimental made publicly available for reproducibility and future advances.Azhar IkhtiarudinAditi DasParam ThakkarAkash Kundu
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237135836710.1609/aaaiss.v7i1.36906Quantum-Classical Hybrid Molecular Autoencoder for Advancing Classical Decoding
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36907
Although recent advances in quantum machine learning (QML) offer significant potential for enhancing generative models, particularly in molecular design, a large array of classical approaches still face challenges in achieving high fidelity and validity. In particular, the integration of QML with sequence-based tasks, such as Simplified Molecular Input Line Entry System (SMILES) string reconstruction, remains underexplored and usually suffers from fidelity degradation. In this work, we propose a hybrid quantum-classical architecture for SMILES reconstruction that integrates quantum encoding with classical sequence modeling to improve quantum fidelity and classical similarity. Our approach achieves a quantum fidelity of approximately 84% and a classical reconstruction similarity of 60%, surpassing existing quantum baselines. Our work lays a promising foundation for future QML applications, striking a balance between expressive quantum representations and classical sequence models and catalyzing broader research on quantum-aware sequence models for molecular and drug discovery.Afrar JahinYi PanYingfeng WangTianming LiuWei Zhang
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237136837310.1609/aaaiss.v7i1.36907A Hybrid Classical-Quantum Fined Tuned BERT for Text Classification
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36908
Fine-tuning BERT for text classification can be computationally challenging and requires careful hyper-parameter tuning. Recent studies have highlighted the potential of quantum algorithms to outperform conventional methods in machine learning and text classification tasks. In this work, we propose a hybrid approach that integrates an n-qubit quantum circuit with a classical BERT model for text classification. We evaluate the performance of the fine-tuned classical-quantum BERT and demonstrate its feasibility as well as its potential in advancing this research area. Our experimental results show that the proposed hybrid model achieves performance that is competitive with, and in some cases better than, the classical baselines on standard benchmark datasets. Furthermore, our approach demonstrates the adaptability of classical–quantum models for fine-tuning pre-trained models across diverse datasets. Overall, the hybrid model highlights the promise of quantum computing in achieving improved performance for text classification tasks.Abu Kaisar Mohammad MasumNaveed MahmudM. Hassan NajafiSercan Aygun
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237137438010.1609/aaaiss.v7i1.36908Bridging Classical and Quantum Computing for Next-Generation Language Models
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36909
The remarkable success of Transformer architectures in Large Language Models (LLMs) has revolutionized natural language processing, yet the transition to quantum computing for next-generation language models remains an open challenge. While quantum computing promises exponential advantages, a fundamental gap exists between classical deep learning and quantum computing paradigms, particularly given the severe constraints of Noisy Intermediate-Scale Quantum (NISQ) devices, including barren plateaus, limited qubit coherence, and circuit depth restrictions. We present Adaptive Quantum-Classical Fusion (AQCF), the first framework to bridge classical and quantum computing for language models by reimagining Transformer architectures through quantum-classical co-design. Our key insight is that effective bridging requires dynamic adaptation rather than static translation—the framework analyzes input complexity in real-time to orchestrate seamless transitions between classical and quantum processing. AQCF introduces entropy-driven adaptive circuits that circumvent barren plateaus, quantum memory banks that unify classical attention with quantum state-based similarity retrieval, and intelligent fusion controllers that ensure each computational paradigm handles tasks where it naturally excels. This bridging architecture maintains full compatibility with existing classical Transformers while progressively incorporating quantum advantages as they become accessible. Experiments on sentiment analysis demonstrate that AQCF achieves competitive performance while significantly improving quantum resource efficiency, operating successfully within typical NISQ constraints. By establishing a seamless integration pathway from today's classical LLMs to tomorrow's quantum-enhanced models, our framework provides both immediate practical value on current quantum hardware and a clear evolution path toward full Quantum LLMs as technology matures.Yi PanHanqi JiangJunhao ChenYiwei LiHuaqin ZhaoLin ZhaoYohannes AbateYingfeng WangTianming Liu
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237138138910.1609/aaaiss.v7i1.36909Parametric Quantum Feature Selection Methods for Fraud and Default Detection
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36910
Feature selection plays an important role in improving the efficiency of machine learning models for credit card fraud and default detection. We formulate the feature selection problem as a Quadratic Unconstrained Binary Optimization (QUBO) problem, which we solve using quantum annealers. We propose three new formulations based on this framework that improve the efficiency and flexibility of machine learning models. We benchmark the proposed methods, existing approaches from the literature, and also compare with classical feature-selection methods such as Random Forest feature importance and a combination of mutual information and Spearman correlation. Extensive experiments show that feature selection using quantum computers consistently performs better than the classical methods. Our experiments show the promise of using quantum computers in machine learning tasks in financial risk assessment applications.Sutapa SamantaDagen WangTodd HodgesAndras Ferenczi
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237139039710.1609/aaaiss.v7i1.36910Monitoring and Evaluating Quantum Generative Models Using Spark and MLflow
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36911
Quantum generative models (QGMs), including Variational Quantum Circuits (VQCs) and Quantum GANs, hold significant potential in generating complex data distributions beyond the capabilities of classical generative approaches. However, robust monitoring and evaluation of QGMs remain underdeveloped due to hardware constraints, stochastic quantum behavior, and reproducibility limitations. This paper proposes a scalable and modular framework using Apache Spark and MLflow to monitor, evaluate, and track the performance of QGMs. The framework enables ingestion of quantum-generated data, distributed computation of performance metrics such as fidelity, entanglement entropy, distributional divergence, and experiment tracking via MLflow. I validate our methodology using Qiskit-based simulated QGMs and demonstrate the effectiveness of classical big data tools in bridging the evaluation gap in quantum ML research.Saman Siadati
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237139840310.1609/aaaiss.v7i1.36911AI Methods for Permutation Circuit Synthesis Across Generic Topologies
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36912
This paper investigates artificial intelligence (AI) methodologies for the synthesis and transpilation of permutation circuits across generic topologies. Our approach uses Reinforcement Learning (RL) techniques to achieve near-optimal synthesis of permutation circuits up to 25 qubits. Rather than developing specialized models for individual topologies, we train a foundational model on a generic rectangular lattice, and employ masking mechanisms to dynamically select subsets of topologies during the synthesis. This enables the synthesis of permutation circuits on any topology that can be embedded within the rectangular lattice, without the need to re-train the model. In this paper we show results for 5x5 lattice and compare them to previous AI topology-oriented models and classical methods, showing that they outperform classical heuristics, and match previous specialized AI models, and performs synthesis even for topologies that were not seen during training. We further show that the model can be fine tuned to strengthen the performance for selected topologies of interest. This methodology allows a single trained model to efficiently synthesize circuits across diverse topologies, allowing its practical integration into transpilation workflows.Victor VillarJuan Cruz-BenitoIsmael FaroDavid Kremer
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237140441010.1609/aaaiss.v7i1.36912LLM-QUBO: An End-to-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36913
Quantum annealing offers a promising paradigm for solving NP-hard combinatorial optimization problems, but its practical application is severely hindered by two challenges: the complex, manual process of translating problem descriptions into the requisite Quadratic Unconstrained Binary Optimization (QUBO) format and the scalability limitations of current quantum hardware. To address these obstacles, we propose a novel end-to-end framework, LLM-QUBO, that automates this entire formulation-to-solution pipeline. Our system leverages a Large Language Model (LLM) to parse natural language, automatically generating a structured mathematical representation. To overcome hardware limitations, we integrate a hybrid quantum-classical Benders' decomposition method. This approach partitions the problem, compiling the combinatorial complex master problem into a compact QUBO format, while delegating linearly structured sub-problems to classical solvers. The correctness of the generated QUBO and the scalability of the hybrid approach are validated using classical solvers, establishing a robust performance baseline and demonstrating the framework's readiness for quantum hardware. Our primary contribution is a synergistic computing paradigm that bridges classical AI and quantum computing, addressing key challenges in the practical application of optimization problem. This automated workflow significantly reduces the barrier to entry, providing a viable pathway to transform quantum devices into accessible accelerators for large-scale, real-world optimization challenges.Huixiang ZhangMahzabeen EmuSalimur Choudhury
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237141141810.1609/aaaiss.v7i1.36913From AI Principles to AI Assurance: an Online Safety Case Study
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36860
Principles-based frameworks for AI assurance have been proposed for various AI/ML use cases, focusing on aspects such as ethical design, trustworthiness, and safety. However, translating these high-level principles into actionable, objective criteria for auditing, particularly by third parties, remains challenging. Our analysis shows this is due to the inherent subjectivity of principles, the need for vertical frameworks tailored to specific AI/ML applications, and the unreliability of information gathered during the assurance process. In this paper, we present a case study on how to develop and operationalise a principles-based framework for AI assurance aimed at assessing the ‘accuracy’ of child sexual exploitation (CSEA) and terrorism detection technologies in the context of online safety. The proposed assurance framework addresses a requirement in the UK's 2023 Online Safety Act to create an 'accreditation' scheme specifically for CSEA and terrorism detection technologies. We discuss the critical challenges for operationalising such principles-based frameworks for assurance, particularly in relation to ensuring transparency, reliability, and consistency in audits. We also map potential issues which remain for effectively assessing and auditing AI/ML technologies, informing the development of future research agendas which further research and development of robust standards for assurance, particularly in sociotechnical contexts.Miranda CrossAndreas GutmannIsmini PsychoulaPedro Friere
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237121010.1609/aaaiss.v7i1.36860Shaping Sustainability: Public Perception Towards Water Consumption
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36861
Understanding public perception towards water consumption is crucial for promoting sustainable water management, shaping effective policies, and enhancing the awareness of water conservation in diverse communities. Although previous works have studied water consumption from self-reported data and motivations, the topic has not been analyzed from the perspective of cognition from social media. Given the significance of social media's broad reach, real-time engagement, and the diverse demographic representation it offers, understanding how public perception is reflected in online discussions can provide valuable insights into societal attitudes, concerns, and behavioral trends related to water consumption. In this work, we performed a cognitive analysis, based on Reddit discussions about water consumption. Our approach includes both sentiment analysis, representing conscious attitudes, and concept mapping analysis, which captures subconscious cognitive frameworks. Sentiment analysis shows overall positive polarity on Reddit with key aspects of water consumption, while concept mapping reveals cognitive frameworks shaping perceptions. Together, these insights inform communication strategies and policy on water conservation.Mengshi GeRui MaoWang ZhaoXulang ZhangGemma Anne CalvertErik CambriaDaniel E. O'Leary
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371111910.1609/aaaiss.v7i1.36861Emerging Uses of AI-Generated Images for Equitable and Transparent Simulations
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36862
Despite the maturity of AI image generation, integration of generative AI in M&S has primarily been limited to text. This paper presents a vision for the use of AI-generated images in M&S with an emphasis on equity and transparency. We suggest several emerging use cases including AI-generated images acting as interfaces between agent-based models and physics-based simulations, encouraging empathetic decision-making by visualizing individual agents, and promoting transparency with symbolic representations that complement textual descriptions of abstract model processes. Finally, we discuss the mitigation of ethical issues related to the deployment of AI-generated images in M&S.Philippe J. GiabbanelliKourosh ShoeleMegan A. Witherow
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371202510.1609/aaaiss.v7i1.36862FairRide: A Cooperative-Game Approach to Fair Surge Pricing in Ridesharing
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36863
Dynamic pricing is the core mechanism that allows rideshare platforms to balance demand and supply. While today’s surge strategies, like Uber’s surge multiplier model, achieve market efficiency, they often raise fairness concerns, disproportionately burdening riders in low-income areas with little or no access to public transit and creating inconsistent earning opportunities for drivers. We introduce FairRide, a cooperative game–theoretic framework that prices trips via Owen value to promote multi-sided fairness for both riders and drivers. We further propose two variants: FairRide+, which captures cross-zone demand interdependencies; and FairRide-Decay, which tempers volatility through temporal smoothing. Using a synthetic dataset of 10 zones (urban, suburban, rural), three vehicle categories, and 3,000 time steps, we compare our models against Uber-style surge and an additive-surge benchmark. FairRide-Decay reduces the incidence of extreme surges to below 8% while preserving rider equity and improving driver opportunity balance; all improvements are statistically significant (p < 0.001). These findings demonstrate that fairness-aware dynamic pricing is feasible at platform scale and establish a foundation for hybrid policies that jointly optimize efficiency, fairness, and driver incentives in real-world ridesharing systems.Aditya Sanjay GujralPavan ReddyAnirudh SrikantSahil Sanjay Gujral
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371263210.1609/aaaiss.v7i1.36863Safety is a Process, not a Score: A Symbol-Aware Safety Evaluation Methodology for GenAI for Social Good Tools in High-Emotion Contexts
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36864
Generative AI tools are increasingly being deployed in sensitive social contexts – from mental health to justice systems – yet current safety metrics remain largely quantitative, decontextualized, and technically narrow. This paper introduces a novel, survivor-informed framework for evaluating GenAI systems in high-emotion, high-risk, or public-facing use cases. Rooted in trauma-informed design and symbolic resonance theory, the “Safety is a Process, Not a Score” framework prioritizes co-regulation, narrative fidelity, and epistemic alignment over one-size-fits-all benchmarks. We describe a collaborative methodology developed with survivors of gender-based violence, including a safety rubric, qualitative risk-mapping protocol, and structured, participant-led test-a-thons. Drawing from a recent field test involving a public-facing GenAI tool, we reflect on what it means to build safety relationally, not just statistically. This approach expands both the evaluative vocabulary and participatory possibilities for AI ethics in real-world deployment.Ashley Khor
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371334110.1609/aaaiss.v7i1.36864Evolve-DGN: An Evolving Dynamic Graph Network for Adaptive and Equitable Resource Allocation in Disaster Response
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36865
The effective distribution of resources during and after a disaster is a problem of immense complexity and critical importance. As disaster situations unfold, the network of affected areas, available resources, and viable transportation routes changes dynamically, rendering static optimization models ineffective. Existing machine learning approaches often fail to capture the complex, evolving spatio-temporal dependencies or handle the frequent topological changes inherent in a crisis zone. This paper introduces Evolve-DGN, a novel framework for adaptive and equitable emergency resource allocation. Evolve-DGN models the disaster environment as a dynamic graph and leverages a unique combination of an evolving dynamic graph neural network and multi-agent reinforcement learning (MARL). The core of the framework is a GNN architecture that evolves its parameters over time, enabling it to adapt to real-time changes in the network topology, including the appearance and disappearance of nodes and edges. This GNN serves as a powerful state encoder for a cooperative MARL system where resource depots act as decentralized agents, learning to make coordinated dispatch decisions. A key contribution is the design of a multiobjective reward function that explicitly promotes efficiency, effectiveness, and equity in resource distribution, addressing a well-documented gap between academic models and practitioner needs. The efficacy of Evolve-DGN is demonstrated in a high-fidelity simulation environment, where it consistently outperforms other learning-based baselines in minimizing resource delivery time, a critical factor in saving lives, while maintaining competitive performance in overall resource distribution.Sachin Kumar
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371424910.1609/aaaiss.v7i1.36865A Framework for Ethical Data Removal from Language Resources: An Example from Low-Resource Language Communities
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36866
As data resources such as Common Crawl, Mozilla Common Voice, The Pile, and LAION increasingly serve as the raw material for foundational models, the ethical implications of data collection practices become more complex. This paper addresses some of the growing concerns regarding data removal from said resources. Further, this paper presents a framework for data resource hosts to follow when deciding when to remove data. It also presents a process for individuals and/or communities to follow when seeking to have their data removed. Finally, numerous technical challenges and societal trade-offs are addressed.Sarah LugerRafael Mosquera-GómezPedro Ortiz SuárezThom Vaughan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371505310.1609/aaaiss.v7i1.36866Do AI Chatbot Firms Practice What They Preach?
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36867
This study examines whether leading AI chatbot companies implement the responsible AI principles they publicly advocate. The authors used a mixed-methods approach analyzing four major chatbots (ChatGPT, Gemini, Deep Seek, and Grok) across company websites, technical documentation, and direct chatbot evaluations. We found significant gaps between corporate rhetoric and practice.Michael MorenoSusan Ariel Aaronson
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371546210.1609/aaaiss.v7i1.36867Selected Bibliography of “AI (Artificial Intelligence) for (Social) Good”
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36868
This paper presents a short bibliography of papers in “AI for Good” and “AI for Social Good” based on a Google Scholar search on July 8, 2025. The purpose of this bibliography is to capture some of the different perspectives and views of AI for Good, to facilitate communication and discussion.Daniel E. O'Leary
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371636410.1609/aaaiss.v7i1.36868Generating Word Lists for Analyzing and Monitoring Social Good: The Case of Sustainability
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36869
This paper examines issues associated with the development of bags of words that can be used to analyze the extent to which descriptors of a concept are related to some dependent variable and as an approach to continuously monitor the occurrence of those concepts in text. We focus on generating bags of words using Word2Vec, using two key sources of business text, Form 10-Ks and “earnings calls,” to support issues of concern in social good. As an experiment, we drill down on building bags of words, that describe independent variables, descriptive of the concept of “sustainability,” which could be related is-sues such as firm value measures (profitability), events (release of new products or mergers) or other dependent variables.Daniel E. O'LearyYangin Ben Yoon
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371657010.1609/aaaiss.v7i1.36869AI and Public Decentralized Networks for Voluntary Carbon Trading
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36870
This paper explores the use of artificial intelligence (AI) and public decentralized networks “for social good,” investigating their use in carbon trading. Carbon trading provides an important capability in support of the Kyoto Protocol, sponsored by the United Nations’ efforts on climate change. We differentiate between voluntary and mandatory carbon markets. Unfortunately, there can be fraudulent trades in either type of market, such as reusing carbon credits multiple times, overstating the amount of a carbon credit, and falsely verifying carbon credits. We dis-cuss the use of public decentralized networks and AI as approaches to facilitate carbon trading, focused primarily on voluntary markets. Our AI analysis includes the use of large language models, sentiment analysis and GOFAI as we review recent potential approaches and developments in voluntary carbon trading.Daniel E. O'LearyGuido L. Geerts
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371717810.1609/aaaiss.v7i1.36870Deciphering Trust: Multi-Modal Affective Analysis in Dynamic Decision-Making Games
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36871
Investigating how trust is built and maintained is especially important as technological advances make scam and fraud easier and quicker to enact. Fields such as neuroeconomics, psychology, and computer science have devoted considerable attention to the roles that emotional expression possess in determining decision making, with many studies utilizing paradigms including trust games, negotiation games, and dilemma games to model real-world decision-making processes. Current research on player behavior and decision making typically isolates specific aspects, such as acts of betrayal by a trustee or the influence of emotional facial expressions. In contrast, the present study comprehensively examines both elements while incorporating automatic facial analysis, adding a source of multimodal affective data. This technology, which allows for real-time, objective, and non-intrusive data collection, has been piloted in a dynamic dyadic trust game environment, where setup and analysis were successful. The following study builds a task framework based on current theories that inform the role of emotion in decision-making, current models that guide predictive decision making, and the role that automatic facial analysis plays within the aforementioned. We implement that framework to conduct a pilot study investigating human behavioral responses to affective expressions applied to a digital agent.Darryl RomanHaily FolleseJordan SchotzShensheng WangJohnathan MellNichole Lighthall
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371798610.1609/aaaiss.v7i1.36871Revealing Abstract Arts with Feedback Induced Crowdsourcing to LLM Sourcing
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36872
Crowdsourcing has received a major attention in solving creative tasks in recent times. Some creative problems are so abstract that they require repetitive interaction between the requester and crowd worker. Moreover, defining a ground truth for such creative tasks is a challenge. This paper aims to address the problem of revealing the content of abstract arts with a feedback-induced mechanism -- both via crowdsourcing and LLM sourcing. As abstract arts are interpreted in different ways, it is interesting to elucidate the content through interaction. We propose an approach that employs a corrective feedback mechanism to enable the requester and crowd workers to interact. The effectiveness of this approach is demonstrated by annotating 30 abstract arts on a crowdsourcing platform. The results show that feedback motivates workers to provide detailed responses, interact further with the requester, and reveal more on the abstract content. Further sentiment analysis on the discussion data reflects the importance of corrective feedback in crowdsourcing. We further extend this by outsourcing the tasks to LLMs and observed a better output. However, some interesting challenges like hallucination and ethical participation by the LLMs emerge through this.Bijoly Saha BhattacharyaBiswajit MandalArindam BiswasMalay Bhattacharyya
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-2371879410.1609/aaaiss.v7i1.36872Auditing the Truth: A Pluralistic Framework for Disinformation Analysis
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36873
Disinformation costs the global economy an estimated $78 billion a year, fueling a frantic race to build AI fact-checkers. Yet, this arms race is creating a dangerous new problem: an unaccountable 'black box of truth' that delivers an authoritative answer without showing its work, further eroding public trust. The world is trying to build an AI referee to make the final call. This paper presents a radical alternative: instead of an AI referee, we need an AI auditor. This paper details the blueprint for a Pluralistic Framework that achieves this by integrating a community-driven Endorsement model with a Comprehensive Truth Verification engine powered by Dempster-Shafer theory. This approach synthesizes conflicting information from experts, officials, and the public to produce not an answer, but a transparent audit that makes the degree of consensus and conflict easy for anyone to understand. This is not just a better fact-checker; it is a framework for turning fact-checking from a private judgment into a public audit, a vital tool for rebuilding trust in our shared reality.Dippu Kumar SinghPraveen Chinapla Bharamappa
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-23719510210.1609/aaaiss.v7i1.36873Towards Fairer AI: Multi-Agent Debiasing of LLMs With Online Evidence Retrieval
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36874
Large Language Models (LLMs) routinely reproduce the social biases embedded in their training data. Existing mitigation techniques such as data augmentation, RLHF, and post hoc filtering often blunt model capabilities or overlook biased reasoning steps. We introduce MADERA (Multi-Agent Debiasing with External Retrieval and Assessment), a self-contained multi-agent framework that (i) diagnoses biased chains of thought, (ii) retrieves relevant web evidence through a search agent, and (iii) iteratively rewrites reasoning until bias is eliminated. We evaluate MADERA on the BBQ–Hard benchmark with four backbones LLMs: DeepSeek-R1, GPT-3.5-Turbo, GPT-4, and Claude-3 Haiku. Across ambiguous prompts it lifts accuracy by an average of +8 percentage points and cuts directional bias by −0.08, with GPT-4 showing the largest gain (0.71 → 0.96 ACC; −0.29 → −0.04 BIAS). Across disambiguated prompts, where models already perform near ceiling, the search agent produces only marginal changes in accuracy and bias. These findings confirm that external web grounding is a key driver of reasoning-level debiasing.Mughees Ur RehmanSaleha Muzammil
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237110310810.1609/aaaiss.v7i1.36874Influence of Gender-Specific Data Imbalance on scGPT Fine-Tuning for Single-Cell Genomics
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36914
The transformer-based foundation model scGPT has demonstrated strong capabilities in analyzing high-dimensional single-cell RNA sequencing data. However, the impact of demographic factors, particularly gender, on model performance remains insufficiently understood. Gender is known to influence cell type compositions in the immune system. Here, using the gender-sensitive cell type composition in immune system, we comprehensively evaluate how the gender-sensitive imbalance of training data influences the performance of scGPT in cell type predictions. We fine-tune scGPT on male-only, female-only, and mixed-gender subsets from two large-scale datasets containing immune cells. We use a logit difference to measure the confidence gap between the true label and the actual model prediction. The confidence gap is zero for perfect classifications and negative for incorrect predictions. We find that training and testing configurations with aligned gender distributions generally show higher prediction confidence, while mismatched gender during training and testing, especially when training excludes one gender, leads to substantial confidence drops. We also find that training with mixed-gender data promotes more balanced generalization, but does not eliminate all biases. We conclude that gender-specific data imbalance, represented by immune cell type subpopulation variation between women and men, can influence fine-tuning of scGPT and its performance in cell type classification, highlighting the importance of addressing such demographic biases in biomedical AI models.Mohammad Aman Ullah Al AminDaniil FilienkoHong Qin
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237142042710.1609/aaaiss.v7i1.36914Predicting Variant Fitness of SARS-COV-2 from Full Viral Genome Sequences
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36915
Accurate prediction of the transmission fitness of emerging SARS-CoV-2 variants is vital for timely public health responses. In this study, we present a deep learning framework that predicts variant fitness from raw genomic sequences using a convolutional neural network (CNN) trained to regress Differential Population Growth Rate (DPGR) values. Our approach achieves high predictive accuracy R-square value of 0.92 on genomic sequences sampled from the USA and Europe. To interpret the model’s predictions, we apply SHapley Additive exPlanations (SHAP) to identify nucleotide-level contributions to predicted fitness. Our analysis highlights key mutations in ORF9 (nucleocapsid), ORF2 (spike), ORF5 (membrane), and ORF8 that either enhance or reduce predicted DPGR. Notably, we identify amino acid–altering mutations such as D3L, E484K, N501Y, and V97I as strong positive contributors to fitness, while synonymous or non-coding mutations had more subtle or regulatory effects. These findings validate the potential of sequence-based modeling and interpretable AI to support early detection and prioritization of high-risk variants.Richard AnnanUrsula NkonuParisa HatamiMd Jubair PanthoLetu QinggeHong Qin
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237142843710.1609/aaaiss.v7i1.36915MedPerturbing LLMs: A Comparative Study of Toxicity, Prompt Tuning, and Jailbreaks in Medical QA
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36916
Large Language Models (LLMs) are increasingly adopted across domains, including sensitive areas such as healthcare. However, their deployment raises significant safety concerns, particularly with respect to toxicity. In this paper, we evaluate the toxicity of widely used general-purpose LLMs in medical question–answering tasks. We investigate three complementary scenarios: (i) baseline querying, (ii) prompt guidelines designed to mitigate toxic outputs, and (iii) adversarial jailbreak prompting intended to elicit harmful content. To measure toxicity, we apply three established metrics to five LLMs ranging from 2B to 9B parameters, using MedPerturb, a dataset of medical questions systematically perturbed across gender, race, and age. Our results show that while carefully crafted guidelines can reduce toxic outputs and mitigate demographic biases, adversarial instructions are highly effective at bypassing safety mechanisms. Our evaluation reveals that all models exhibit limited resilience to jailbreak attacks, highlighting a critical vulnerability that restricts their safe deployment in clinical contexts. By answering three key questions—(1) what levels of toxicity these models exhibit in standard medical scenarios, (2) how far prompt tuning can reduce toxicity, and (3) how vulnerable they are to jailbreaks, our study provides a structured assessment of the risks and limitations of LLMs in healthcare, and shows the importance of establishing robust guidelines and protections to promote the safe deployment of LLMs in healthcare and to guard against harmful misuse.Arash AsgariAmirreza NaziriLaleh Seyyed-Kalantari
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237143844710.1609/aaaiss.v7i1.36916Temporal Concept Tracing: Making Deep Learning Predictions Interpretable and Actionable for ICU Acute Kidney Injury Prevention
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36917
Deep learning models have demonstrated impressive accuracy in predicting acute kidney injury (AKI), a condition affecting up to 20% of ICU patients, yet their black-box nature prevents clinical adoption in high-stakes critical care settings. While existing interpretability methods like SHAP, LIME, and attention mechanisms can identify important features, they fail to capture the temporal dynamics essential for clinical decision-making, and are unable to communicate when specific risk factors become critical in a patient's trajectory. This limitation is particularly problematic in the ICU, where the timing of interventions can significantly impact patient outcomes. We present a novel interpretable framework that brings temporal awareness to deep learning predictions for AKI. Our approach introduces three key innovations: (1) a latent convolutional concept bottleneck that learns clinically meaningful patterns from ICU time-series without requiring manual concept annotation, leveraging Conv1D layers to capture localized temporal patterns like sudden physiological changes; (2) Temporal Concept Tracing (TCT), a gradient-based method that identifies not only which risk factors matter but precisely when they become critical addressing the fundamental question of temporal relevance missing from current XAI techniques; and (3) integration with MedAlpaca to generate structured, time-aware clinical explanations that translate model insights into actionable bedside guidance. We evaluate our framework on MIMIC-IV data, demonstrating that our approach performs better than existing explainability frameworks, Occlusion and LIME, in terms of the comprehensiveness score, sufficiency score, and processing time. The proposed method also better captures risk factors inflection points for patients timelines compared to conventional concept bottleneck methods, including dense layer and attention mechanism. This work represents the first comprehensive solution for interpretable temporal deep learning in critical care that addresses both the what and when of clinical risk factors. By making AKI predictions transparent and temporally contextualized, our framework bridges the gap between model accuracy and clinical utility, offering a path toward trustworthy AI deployment in time-sensitive healthcare settings.S M Saiful Islam BadhonSerdar BozdagMohammad AdibuzzamanAna D. ClevelandJunhua DingKSM Tozammel Hossain
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237144845510.1609/aaaiss.v7i1.36917CORE-Coma: Deep Learning Framework for Coma Prognosis from Auditory Event-Related Potentials
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36918
Accurate prognosis of coma emergence is difficult because bedside behavioral scales can fail to detect residual consciousness. Auditory oddball event-related potentials (ERPs) offer a physiological readout, but single-component markers (e.g., MMN or P3) have limited sensitivity and generalizability. We present CORE-Coma, a deep learning framework for full-waveform ERP analysis, trained exclusively on healthy controls and evaluated zero-shot in coma patients. We analyzed ERPs from 39 healthy controls and 8 coma patients in the intensive care unit (ICU), segmenting EEG recordings into ~5-minute sub-blocks to capture temporal fluctuations. We define two complementary, model-derived metrics: a time-resolved ERP Separability Score (ESS) and a subject-level Global ERP Separability Index (GESI). Controls showed near-ceiling standard–deviant separability (ROC AUC=0.99), while separability was reduced in coma (ROC AUC=0.68). CORE-Coma identified all patients who emerged from coma (3/3; sensitivity 100%) and 4/5 patients who did not emerge (specificity 80%), yielding accuracy=87.5% (7/8). ESS revealed temporal fluctuations (waxing–waning) of responsiveness in coma at ~5-minute resolution, absent in controls. SHAP explanations localized influential features, including frontocentral electrodes and time windows consistent with canonical oddball components: 100–150 ms (N1/MMN) and 270–370 ms (P3a/P3b). By combining bedside-feasible scalp EEG with time-resolved and subject-level metrics, CORE-Coma offers an etiology-agnostic approach to coma prognosis. Prospective multicenter studies are needed to validate performance and support clinical deployment.Elham BagheriPaniz TavakoliAdianes Herrera-DiazRober BoshraRichard KolesarAlison Fox-RobichaudJohn F. ConnollyJames Reilly
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237145646510.1609/aaaiss.v7i1.36918Category-Aware Fine-Tuning and Cross-Age Transferability in Image Memorability Prediction
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36919
Image memorability is highly consistent across observers, yet current vision models achieve only moderate accuracy and remain below human consistency. We study two questions: (i) whether making semantic category structure explicit during training improves prediction, and (ii) whether adult-trained predictors transfer to adolescents, and whether any gains from category-specific adaptation generalize across observers of different age. We compare a mixed-category model (All) with per-category fine-tuning (CatFT) for two pretrained backbones, MemNet (AlexNet-based CNN) and ViT-B/16 (Vision Transformer), each fine-tuned on MemCat under All and CatFT. Adult-trained models are evaluated on Memoir (adolescent labels) without additional training to assess transfer, and Grad-CAM is used to examine which regions drive predictions on the best model. On adults, category-aware training increases Spearman’s rho for both backbones (ViT-B/16: 0.548→0.592; MemNet: 0.429→0.477). Memorability prediction itself transfers across age even without category-specific fine-tuning (ViT-B/16: rho=0.456 with All), with a small additional adolescent gain from CatFT (to rho=0.471); MemNet remains stable on adolescents (rho=0.405 with or without CatFT). Grad-CAM highlights semantically meaningful regions for highly memorable images and more diffuse patterns for low-memorability images. Overall, incorporating category structure improves adult accuracy, cross-age generalization of memorability prediction is robust, and among the tested backbones, ViT-B/16 performs best, with CatFT providing modest transfer gains.Elham BagheriJohann CardenasYalda Mohsenzadeh
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237146647310.1609/aaaiss.v7i1.36919Embedding vs Image-Based AI: A Comparative Fairness Study in Chest X-ray Analysis
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36920
AI has shown remarkable potential in healthcare, but faces accessibility challenges due to high computational and expertise demands, especially in medical image analysis. Vector embeddings, compact representations of medical images achieved from foundation models in zero-shot inference, offer a potential solution. Recently, an equivalent vector embeddings dataset of existing large publicly available medical images has been released, for which training an AI model requires significantly lower computing infrastructure and storage needs. Such data sets provide greater accessibility to AI in medical imaging for those who do not have access to large computing resources. The burning question remains: What is the gain or loss in using vector embedding to replace medical images, particularly from a fairness and utility point of view? In this work, we compare AI models trained in vector embeddings (Emb) with raw chest radiograph images for disease diagnosis, focusing on both performance and fairness. Our results show that Emb-based models match or exceed image-based models in diagnostic performance while improving fairness. Crucially, Emb achieves this with far less computational cost. These findings position Emb as a powerful, scalable alternative to image-based AI, especially valuable for low-resource settings where access to GPUs and expert infrastructure is limited.Gebreyowhans H. BahreHassan HamidiAndrew B. SellergrenLeo Anthony CeliFrancesco CalimeriLaleh Seyyed-Kalantari
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237147448010.1609/aaaiss.v7i1.36920Shard-Unlearn: A Sharded Elastic SGD Privacy Preserving Federated Unlearning Framework for 5G-assisted Healthcare
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36921
Smart healthcare systems are generating unprecedented volumes of sensitive data, making robust privacy preservation a critical requirement. Traditional machine unlearning (MU) techniques aims to excise specific data points and their statistical influence from trained machine learning (ML) model. Thus, they suffer from limited computational efficiency, poor scalability, and suboptimal model convergence when applied to largescale, big-data (BD) healthcare environments. These limitations become even more significant in 5G-assisted settings, where real-time connectivity and rapid data processing are essential. To address these challenges, we introduce the concept of data sharding which partitions healthcare datasets into manageable segments. In the paper, we introduce Shard-Unlearn framework, that implements federated unlearning (FU) process to the shards that contain sensitive data. This reduces the overall computational overhead and optimizes model convergence over 5G networks. In the framework, we present the elastic stochastic gradient descent (SGD) optimization which effectively remove the targeted data and associated statistical perturbations from the local models. The framework is tested over the ADMISSIONS benchmark dataset, which is divided 10 shards. The framework is compared on computational efficiency, model robustness, and privacy preservation metrics. Statistical findings reveals a 47.14% improvement in unlearning impact (as measured by recall) while striking a balanced trade-off between performance and data security. These results underscore the viability of the framework as a scalable and privacy-preserving solution for modern 5G-assisted healthcare systems.Sudip ChatterjeeSamyak JainPronaya BhattacharyaSandip RoySoumya BanerjeePratip RanaSachin Shetty
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237148148710.1609/aaaiss.v7i1.36921Visual Gait Alignment for Sensorless Prostheses: Toward an Interpretable Digital Twin Framework
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36922
A safe and interpretable visual method for prosthetic alignment assessment is proposed, suitable for sensorless scenarios such as home rehabilitation and telemedicine. The method collects human skeletal data based on a depth camera and extracts the motion difference characteristics of the left and right legs through gait symmetry analysis. Three types of clearly structured evaluation indicators are designed, including differences in joint range of motion, differences in swing phase duration, and angular trajectory similarity, to construct an interpretable alignment scoring function. This system is designed as a front-end module of a digital twin system. The scoring results can intuitively reflect differences in wearing status, facilitating real-time evaluation and adjustment of prosthetic alignment quality. Preliminary experiments have verified the stability and practicality of this method under visual recognition conditions, laying the foundation for personalized prosthetic optimization based on digital twins.Jingyang CuiFei HuGreg BerkeleyWeiqiang LyuXiangrong Shen
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237148849510.1609/aaaiss.v7i1.36922Fine-Tuning Large Language Models for Structured Clinical Report Generation Using GRPO
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36923
The generation of structured medical reports using large language models (LLMs) presents unique challenges, particularly in maintaining clinical relevance and adhering to strict formatting requirements. In this work, we investigate the effectiveness of fine-tuning LLMs for structured report generation using DeepSeek R1 models. We conduct experiments with two model variants: DeepSeek R1 8B and DeepSeek R1 14B. For both models, we apply Group Relative Policy Optimization (GRPO) using the Medical Information Mart for Intensive Care (MIMIC-IV) dataset, leveraging Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our results show that the GRPO fine-tuned DeepSeek-R1 8B and 14B models outperformed all baseline models, including the larger 32B DeepSeek-R1 model, demonstrating the effectiveness of parameter-efficient tuning. These findings underscore the potential of reinforcement learning-based fine-tuning of LLMs for generating structured reports in the medical domain.Uday DevulapalliAarat SatsangiApurva Narayan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237149650010.1609/aaaiss.v7i1.36923Predicting Glucose Test Ordering in Hospitalized Patients Using Temporal Models of Clinical Context Embeddings
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36924
The overuse of laboratory tests is a persistent challenge in healthcare, driving unnecessary costs, patient discomfort, and low-value care. Glucose testing, one of the most common diagnostics, exemplifies this issue in hospital settings. We present a deep learning framework that integrates structured and unstructured electronic medical record data to predict whether a glucose test will be ordered in the next AM/PM time bin. Using multi-hospital data from the GEMINI dataset, we combine Long Short-Term Memory models with Clinical BioBERT embeddings to capture both the timing and clinical context of testing. On held-out test data, our best model achieved ROC-AUC of 0.92 and PR-AUC of 0.67, and generalized across sites in leave-one-hospital-out evaluation (ROC-AUC 0.84). Embedding-based models outperformed traditional feature representations, though adding more tests and vitals did not always yield further gains. By contrast, introducing a simple temporal recency cue (bin counter) improved performance. An exploratory regression task for predicting glucose values performed worse, likely due to class imbalance and reliance on forward-filled values; Random Forest achieved R^2 of 0.80 under masked evaluation, indicating a need for more frequent or diverse test data. Predicting laboratory test ordering is the first step toward evaluating the usefulness of laboratory test use and establishes a foundation for future real-time decision support to reduce unnecessary lab use in hospitals.Joud El-ShawaElham BagheriAmol VermaYalda Mohsenzadeh
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237150150510.1609/aaaiss.v7i1.36924Towards Personalized Explanations for Health Simulations: A Mixed-Methods Framework for Stakeholder-Centric Summarization
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36925
Modeling & Simulation (M&S) approaches such as agent-based models hold significant potential to support decision-making activities in health, with recent examples including the adoption of vaccines, and a vast literature on healthy eating behaviors and physical activity behaviors. These models are potentially usable by different stakeholder groups, as they support policy-makers to estimate the consequences of potential interventions and they can guide individuals in making healthy choices in complex environments. However, this potential may not be fully realized because of the models' complexity, which makes them inaccessible to the stakeholders who could benefit the most. While Large Language Models (LLMs) can translate simulation outputs and the design of models into text, current approaches typically rely on one-size-fits-all summaries that fail to reflect the varied informational needs and stylistic preferences of clinicians, policymakers, patients, caregivers, and health advocates. This limitation stems from a fundamental gap: we lack a systematic understanding of what these stakeholders need from explanations and how to tailor them accordingly. To address this gap, we present a step-by-step framework to identify stakeholder needs and guide LLMs in generating tailored explanations of health simulations. Our procedure uses a mixed-methods design by first eliciting the explanation needs and stylistic preferences of diverse health stakeholders, then optimizing the ability of LLMs to generate tailored outputs (e.g., via controllable attribute tuning), and then evaluating through a comprehensive range of metrics to further improve the tailored generation of summaries.Philippe J. GiabbanelliAmeeta Agrawal
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237150651510.1609/aaaiss.v7i1.36925Data-Aware Layer Assignment for Secure and Efficient Communication in Federated Learning for Medical Image Analysis
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36926
Cross-silo medical imaging federations must contend with strict privacy, limited bandwidth, and non identically distributed (non-IID) data that destabilize training. Current federated learning (FL) architectures either carry the full model (e.g., FedAvg/FedProx) or use naive client/layer pruning and random sampling while ignoring both non-IID heterogeneity and per-layer utility. Based on these limitations, the paper presents a dataaware, layer-wise protocol that aligns communication with expected loss descent while bounding per-round client leverage. Each round, the server estimates per-layer influence from a tiny root set, and clients expose lightweight metadata to form data-quality scores. A capacity-constrained entropic transport matches high-influence layers to high-quality clients under redundancy and temporal coverage. Clients train all layers but upload exactly one with train-all, send-one principle. The server then performs per-layer robust aggregation on masked updates via secure aggregation. On the three cross-silo imaging benchmarks of Pneumonia CXR, Brain-Tumor MRI, and ISIC Skin Cancer, it demonstrates a strong threshold free detection quality (AUROC/ AUPRC: 0.925/0.935, 0.996/0.988, 0.834/0.852, respectively) while also reducing the per round up-link by ≈ 1/n with respect to FedAvg (e.g., ≈ 10× with 10 clients) by only receiving one layer per client. Indicating its viability for deployment-grade secure aggregation for hospital networks.Sai Sriram GonthinaSandip RoyPronaya BhattacharyaPratip RanaSachin Shetty
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237151652310.1609/aaaiss.v7i1.36926One Pixel Can Change the Diagnosis: Adversarial and Non-Adversarial Robustness and Uncertainty in Breast Ultrasound Classification Model
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36927
Deep learning models have strong potential for automating breast ultrasound (BUS) image classification to support early cancer detection. However, their vulnerability to small input perturbations poses a challenge for clinical reliability. This study examines how minimal pixel-level changes affect classification performance and predictive uncertainty, using the BUSI dataset and a ResNet-50 classifier. Two perturbation types are evaluated: (1) adversarial perturbations via the One Pixel Attack and (2) non-adversarial, device-related noise simulated by setting a single pixel to black. Robustness is assessed alongside uncertainty estimation using Monte Carlo Dropout, with metrics including Expected Kullback–Leibler divergence (EKL), Predictive Variance (PV), and Mutual Information (MI) for epistemic uncertainty, and Maximum Class Probability (MP) for aleatoric uncertainty. Both perturbations reduced accuracy, producing 17 and 29 “fooled” test samples, defined as cases classified correctly before but incorrectly after perturbation, for the adversarial and non-adversarial settings, respectively. Samples that remained correct are referred to as “unfooled.” Across all metrics, uncertainty increased after perturbation for both groups, and fooled samples had higher uncertainty than unfooled samples even before perturbation. We also identify spatially localized “uncertainty-decreasing” regions, where individual single-pixel blackouts both flipped predictions and reduced uncertainty, creating overconfident errors. These regions represent high-risk vulnerabilities that could be exploited in adversarial attacks or addressed through targeted robustness training and uncertainty-aware safeguards. Overall, combining perturbation analysis with uncertainty quantification provides valuable insights into model weaknesses and can inform the design of safer, more reliable AI systems for BUS diagnosis.Kuan HuangNoorul SahelDikshya KarkiMeng XuYingfeng Wang
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237152452910.1609/aaaiss.v7i1.36927Filtered-ViT: A Robust Defense Against Multiple Adversarial Patch Attacks
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36928
Deep learning vision systems are increasingly deployed in safety-critical domains such as healthcare, yet they remain vulnerable to small adversarial patches that can trigger misclassifications. Most existing defenses assume a single patch and fail when multiple localized disruptions occur, the type of scenario adversaries and real-world artifacts often exploit. We propose Filtered-ViT, a new vision transformer architecture that integrates SMART Vector Median Filtering (SMART-VMF), a spatially adaptive, multi-scale, robustness-aware mechanism that enables selective suppression of corrupted regions while preserving semantic detail. On ImageNet with LaVAN multi-patch attacks, Filtered-ViT achieves 79.8% clean accuracy and 46.3% robust accuracy under four simultaneous 1% patches, outperforming existing defenses. Beyond synthetic benchmarks, a real-world case study on radiographic medical imagery shows that Filtered-ViT mitigates natural artifacts such as occlusions and scanner noise without degrading diagnostic content. This establishes Filtered-ViT as the first transformer to demonstrate unified robustness against both adversarial and naturally occurring patch-like disruptions, charting a path toward reliable vision systems in truly high-stakes environments.Aja KhanalAhmed FaidApurva Narayan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237153053810.1609/aaaiss.v7i1.36928Conformal Prediction and Verification of Large Language Model Extractions in EHR Data
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36929
While Electronic Health Records (EHRs) promise comprehensive documentation of patient care, in reality there are significant challenges in data reliability and utilization. EHRs contain vast amounts of unstructured clinical narratives that, despite containing critical and relevant medical information, remain difficult to systematically extract and verify. Recent advances in large language models (LLMs) offer increasingly improving capabilities for extracting structured information from clinical notes, yet these approaches raise fundamental questions about output reliability, over-confident token predictions, and provide no guarantees (statistical or otherwise) for downstream clinical applications. In this work, we present a conformal verification framework for unstructured EHR data extraction using generative AI. While LLMs have increasingly impressive capabilities, they are notoriously miscalibrated and overconfident in their predictions, necessitating rigorous verification methods to eliminate the need to trust AI models. Our approach (i) employs LLMs to extract medical entities and concepts from clinical narratives with LLM-as-a-judge verification, (ii) implements probabilistic calibration to quantify extraction confidence, and (iii) applies conformal prediction to provide finite-sample guarantees on error rates for accepted extractions. We evaluate our framework on 10k clinical visits across 898 clinical practices utilizing three different EHR systems. Our conformal verification approach can provide assurances that the future expected proportion of accepted but incorrect extractions remains below a pre-specified risk level with rigorous statistical verification. It also maintains formal guarantees over clinical data quality, and illuminates the miscalibrations present in state-of-the-art LLM models, requiring additional validation for safe deployment of automated extraction systems.Edward KimRichard FotyManil ShresthaVicki Seyfert-Margolis
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237153954610.1609/aaaiss.v7i1.36929Transfer Learning for Subject-Independent Sleep Deprivation Detection from Resting-State EEG
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36930
Sleep deprivation (SD) impairs cognition and heightens safety risks, yet reliable electroencephalography (EEG)-based detection remains challenging in low-data settings. We evaluated transfer learning with a compact Convolutional Neural Network (CNN) (EEGNetv4) to classify SD versus well-rested wakefulness using an open-source EEG dataset containing eyes-open resting-state data from 71 healthy young adults. EEGNetv4 was initialized with publicly available weights pretrained on an Event-Related Potential (ERP) dataset. Shape-compatible layers were transferred and frozen, with the remaining layers trained on the target data. Baselines comprised EEGNetv4, a bidirectional Long Short-Term Memory (LSTM), and a Transformer model, each trained without pretraining. Five-fold subject-independent cross-validation was used to evaluate model performance. EEGNetv4 with transfer learning achieved the highest mean accuracy (70.79% ± 4.17), outperforming EEGNetv4 trained from scratch (65.75% ± 5.48), the Transformer (63.35% ± 2.78), and the LSTM (61.70% ± 3.20). These findings suggest that leveraging pretrained EEG representations can enhance subject-generalizable SD classification in small-sample contexts, supporting transfer learning as a pragmatic strategy for neurophysiological applications.Daya KumarUday DevulapalliSaptharishi Lalgudi GanesanApurva Narayan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237154755210.1609/aaaiss.v7i1.36930From Bias to Breakdown: Benchmarking Failure Mode Analysis of Single-cell RNA Sequencing Foundation Models in Acute Myeloid Leukemia
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36931
Foundation models (FMs) trained on large-scale single-cell RNA-seq (scRNA‐seq) data have shown strong performance across various biological tasks. These performances are often reported across a large set of test benchmarks across all samples. However, the pretraining data of these models are often highly imbalanced across disease types, patients' conditions, and demographics. For instance, disease samples are rarer and more challenging to collect, and the pretraining sets contain many more healthy cells. Such imbalances can hurt performance on underrepresented disease cases and the equality of the model outcome. To evaluate this hypothesis, we benchmark off-the-shelf scRNA-seq foundation models for cell-type classification in acute myeloid leukemia (AML), a rare but clinically important disease that represents low-prevalence settings. Here, besides overall performance, we conduct subgroup analysis of the outcome across cell types and disease conditions (clinical timepoints). Our results suggest that despite high overall F1 scores in cell-type classification, performance drops in disease conditions and varies across cell types. These findings highlight a limitation of current scRNA-seq foundation models and motivate more balanced pretraining and failure mode analysis rather than an overall performance report.Amirreza NaziriArash AsgariAijun AnEleftherios SachlosLaleh Seyyed-Kalantari
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237155355710.1609/aaaiss.v7i1.36931Towards Reliable Lung Cancer Prediction: A Hybrid Framework for Noise Reduction and Uncertainty Control
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36932
Uncertainty remains a critical challenge in healthcare AI, since predictive errors can directly compromise patient safety and undermine trust. Structured clinical datasets in healthcare are frequently characterized by heterogeneous acquisition protocols, incomplete records, and inconsistent or noisy encodings. This inflates aleatoric uncertainty and weakens calibration. These challenges are exemplified in lung cancer risk modeling, where small cohorts, variable collection practices, and limited feature quality make the problem especially acute. Significant advances in uncertainty quantification (UQ) have been achieved in imaging and signal processing through Bayesian inference, evidential learning, and robust architectural designs. In contrast, tabular clinical datasets remain a critical yet underexplored domain. Addressing this gap requires methods that are lightweight, certifiable, and effective on noisy datasets without relying on large models or data. Considering this challenges, we propose a frequency-aware hybrid representation that combines Principal Component Analysis (PCA) with the Discrete Cosine Transform (DCT). Using mutual information (MI)–based feature ordering, the framework suppresses high-frequency artifacts while preserving discriminative structure. As the framework was applied to a publicly available lung cancer dataset, it demonstrated an accuracy improvement from 98.1% to 99.7%, reduced Negative Log-Likelihood (NLL) by 82% from 5.25% to 0.94%, lowered aleatoric uncertainty from 10.50% to 3.35% (68% reduction), and preserved AUROC at 99%. We evaluated the framework across three publicly available lung cancer datasets where it demonstrated a reduction in aleatoric uncertainty by 7% on an average, confirming generalizability. The Wilcoxon signed-rank test confirms that the results are statistically significant. This work shows that part of the ‘irreducible’ variability is actually compressible noise, thereby facilitating more reliable and uncertainty-aware AI for healthcare.Sourojit PalSandip RoyPratip RanaAvishek BanerjeeKoushik MajumderSachin Shetty
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237155856510.1609/aaaiss.v7i1.36932Efficient Context Retention in LLMs: Enhancing In-Context Memorization as an Alternative
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36933
Large Language Models (LLMs) are widely utilized for tasks requiring contextual understanding; however, their reliance on large context windows introduces significant computational overhead due to the transformer's quadratic complexity. This inefficiency is a critical barrier to their deployment in resource-constrained settings like rural healthcare, where processing longitudinal patient data from Electronic Health Records (EHRs) is essential. To achieve this, our research investigates an alternative paradigm: training lightweight, specialized models for complete knowledge internalization, enabling them to function as persistent and efficient knowledge bases on local hardware. Our methodology involves training a 12-layer, 124-million-parameter nanoGPT model de novo on specialized subsets of the MMLU benchmark, including domains relevant to healthcare. The training objective was explicitly data internalization, not generalization. The entire domain-specific corpus, consisting of over 250,000 tokens formatted for a question-and-answer recall task, was used for training until the model achieved near-zero training loss. Performance was then evaluated on the model's ability to perfectly reproduce answers from a "seen" validation set, with recall certainty quantified via softmax probabilities. The resulting models successfully internalized their respective knowledge domains, achieving near-100% accuracy on recall tasks with high confidence scores. This outcome validates that targeted training for memorization can produce reliable and computationally efficient expert agents. For rural health, this approach offers a practical alternative to large context windows, enabling the deployment of a fleet of specialized models on local hardware for tasks like patient history recall or clinical guideline retrieval. This drastically reduces computational costs and latency, providing a scalable solution without requiring continuous, high-bandwidth cloud access.Bansari PatelEdward Kim
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237156656610.1609/aaaiss.v7i1.36933C2BM: Causal Concept Disentanglement for Fair Multimodal COVID-19 Detection
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36934
Algorithmic bias in COVID-19 detection systems poses a serious threat to equitable pandemic response, as demographic disparities in model performance risk worsening health outcomes across vulnerable populations. We present an adopted Causal Concept Bottleneck Model (C2BM) framework that systematically addresses fairness in multimodal COVID-19 detection by learning interpretable concepts from chest CT scans and patient metadata. Our approach targets the Country → Institution → COVID causal pathway through principled interventions, achieving substantial bias reduction: age and gender demographic parity differences decrease from 51.15% to 18.50% (64% reduction), gender disparate impact improves from 0.6475 to 0.9812 (51% improvement), while preserving 98.45% diagnostic F1-score. Through comprehensive evaluation across four model variants, we demonstrate that causal interventions enable stable and reproducible fairness improvements without compromising clinical utility. Our work establishes that principled causal reasoning can achieve practical fairness-accuracy trade-offs in COVID-19 detection systems, providing actionable guidance for equitable healthcare AI deployment.Letu QinggeHailemicael Lulseged YimerMaxwell SamRichard AnnanRobert NewmanHong Qin
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237156757510.1609/aaaiss.v7i1.36934Preventing Another Tessa: Modular Safety Middleware for Health-Adjacent AI Assistants
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36935
In 2023, the National Eating Disorders Association’s (NEDA) chatbot Tessa was suspended after providing harmful weight-loss advice to vulnerable users—an avoidable failure that underscores the risks of unsafe AI in healthcare contexts. This paper examines Tessa as a case study in absent safety engineering and demonstrates how a lightweight, modular safeguard could have prevented the incident. We propose a hybrid safety middleware that combines deterministic lexical gates with an in-line large language model (LLM) policy filter, enforcing fail-closed verdicts and escalation pathways within a single model call. Using synthetic evaluations, we show that this design achieves perfect interception of unsafe prompts at baseline cost and latency, outperforming traditional multi-stage pipelines. Beyond technical remedies, we map Tessa’s failure patterns to established frameworks (OWASP LLM Top10; NIST SP 800-53), connecting practical safeguards to actionable governance controls. The results highlight that robust, auditable safety in health-adjacent AI does not require heavyweight infrastructure: explicit, testable checks at the last mile are sufficient to prevent “another Tessa,” while governance and escalation ensure sustainability in real-world deployment.Pavan ReddyNithin Reddy
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237157658310.1609/aaaiss.v7i1.36935Hermes: A Modular Multi-Agent System for Structuring Clinical Text
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36936
In today's age of information, unstructured information can become overwhelming and difficult to interpret, particularly in safety critical domains such as healthcare where the volume and complexity of unstructured textual notes is required to be interpretable, insightful, and easily automated for processing. This paper introduces Hermes, a modular agentic system that transforms unstructured clinical text into a modified version of the Subjective-Objective-Assessment-Plan (SOAP) format and generates a knowledge graph offering a high-level, distilled view that facilitates downstream clinical reasoning and decision-making. Hermes employs a multi-agent architecture consisting of four specialized components: Hermes-R (report generation), Hermes-G (knowledge graph generation), Hermes-Q (question-answer pair generation), and Hermes-A (answer generation). These agents operate sequentially with validation to generate structured medical information using iterative refinement. Preliminary evaluations on a few samples demonstrate that Hermes is able to generate structured clinical reports and knowledge graphs according to provided specifications from unstructured discharge summaries with good consistency, accuracy, and reward score. Hermes offers a unified framework that advances clinical natural language processing, bridging structured representation, question answering, and semantic validation.Aarat SatsangiJoud El-ShawaUday DevulapalliApurva Narayan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237158458910.1609/aaaiss.v7i1.36936Confidence Calibration in Large Language Models for Uncertainty Quantification: Affecting Calibration with Conditional Weight Updates
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36937
In any medical applications of Large Language Models (LLMs), it is critical to have accurate uncertainty quantification, as well as control over the over- and under-confidence of the model. Current fine-tuning (FT) methods lack this control, partly because they fail to account for the fact that repeated exposure to a fact does not make it more correct. We propose a revised FT method that updates model weights only when the model does not sufficiently “know” an answer. We fine-tuned Meta's Llama-3.2, 1B parameter model on the MMLU multiple-choice dataset using traditional FT methods for a Control Model and Conditional Update FT for an Experi-mental Model. The tuned models showed different results, with the Control showing greater overconfidence and the Experimental Model showing greater under-confidence as compared to the Base Model. Additionally, the Experimental Model showed a more even distribution of confidence scores, which is advantageous for post-calibration. This method for affecting confidence calibration while fi-ne-tuning LLMs may potentially help in the broader challenge of creating reliable and trustworthy LLMs.Sophia SomersEdward Kim
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237159059310.1609/aaaiss.v7i1.36937Duty of Care: A Call for Open and Responsible AI Innovation in Healthcare
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36938
Recent advances in AI, especially those of LLMs, bring the prospect of increased adoption of AI in medicine and medical education. In particular, many institutions responsible for medical treatment and education are rapidly aiming to increase AI use in practice and curricula. However, the potential downsides of overuse of AI in these fields are under-discussed. In the rush to AI adoption, sources of healthcare risk such as LLM reliability, patient privacy, financial and environmental costs, vendor dependencies, and AI over-reliance are often not deeply considered. This paper discusses these recent trends and makes recommendations for healthcare institutions considering further adoption of AI.Jonathan S. Takeshita
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237159459810.1609/aaaiss.v7i1.36938Conformal Risk Control for Semantic Uncertainty Quantification in Computed Tomography
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36939
Uncertainty quantification is necessary for developers, physicians, and regulatory agencies to build trust in machine learning predictors and improve patient care. Beyond measuring uncertainty, it is crucial to express it in clinically meaningful terms that provide actionable insights. This work introduces a conformal risk control (CRC) procedure for organ-dependent uncertainty estimation, ensuring high-probability coverage of the ground-truth image. We first present a high-dimensional CRC procedure that leverages recent ideas of length minimization. We make this procedure semantically adaptive to each patient's anatomy and positioning of organs. Our method, semCRC, provides tighter uncertainty intervals with valid coverage on real-world computed tomography data while communicating uncertainty with clinically relevant features.Jacopo TeneggiJ. Webster StaymanJeremias Sulam
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237159960510.1609/aaaiss.v7i1.36939Adaptive Explanations via Direct Preference Optimization
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36940
Machine learning explainability aims to make the decision-making process of black-box models more transparent by finding the most important input features for a given prediction task. Recent works have proposed composing explanations from semantic concepts (e.g., colors, patterns, shapes) that are inherently interpretable to the user of a model. However, these methods generally ignore the communicative context of explanation---the ability of the user to understand the prediction of the model from the explanation. For example, while a medical doctor might understand an explanation in terms of clinical markers, a patient may need a more accessible explanation to make sense of the same diagnosis. In this work, we address this gap with listener-adaptive explanations. We propose an iterative procedure grounded in principles of pragmatic reasoning and the rational speech act to generate explanations that maximize communicative utility, and we evaluate our method on classification of lung X-rays. Our procedure only needs access to pairwise preferences between candidate explanations, relevant in real-world scenarios where a listener model may not be available.Jacopo TeneggiZhenzhen WangPaul H. YiTianmin ShuJeremias Sulam
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237160661110.1609/aaaiss.v7i1.36940How Missing Medication Data Contributes to Bias in Alzheimer’s Disease Machine Learning Models
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36941
Alzheimer's disease (AD) is the most common cause of dementia, yet many cases go undiagnosed due to limited access to expensive brain scans and lab tests. This study investigated whether medication data could help identify AD. Using data from 1,785 participants in the US-representative National Health and Nutrition Examination Survey 2013– 2014, we identified 105 individuals (5.9%) with memory test scores suggesting possible AD. We evaluated seven machine learning models using medication features. Models that incorporated contextual prescription information, including the reasons for medication use and conditions being treated, achieved the best performance (area under the receiver operating characteristic curve [AUC] 0.61–0.63). In contrast, models using only basic drug names or provider information performed poorly (AUC 0.46–0.51). This performance difference was statistically significant (t = 14.98, p < 0.0001). Our findings suggest that medication data, when analyzed with attention to clinical context, could serve as a low-cost tool for identifying individuals at risk of AD. This approach may help address diagnostic disparities in settings with limited access to advanced testing.Earlel ThiyagaratnamHawwa KhanApurva Narayan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237161261910.1609/aaaiss.v7i1.36941Toward Preventive Alzheimer’s Risk Screening with Cell-Subtype-Aware Brain-Blood Graph Neural Network
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36942
Early Alzheimer’s disease (AD) pathology begins decades before symptoms emerge, yet over 75% of the at-risk population lacks access to non-invasive screening methods. Current diagnostic tools like PET imaging and cerebrospinal fluid sampling are costly, invasive, and poorly suited for large-scale, proactive brain health monitoring. This research introduces a cell-subtype-aware brain-blood gene modeling framework that reframes AD assessment from reactive diagnosis to preventive risk evaluation for sustained cognitive health. Using graph neural networks, blood RNA-seq profiles are anchored to sex-specific, single-cell brain transcriptomics across neuronal layers, enhancing biological fidelity and interpretability. Explicit control of APOE4 genotype, age, sex, and education preserves meaningful variation while suppressing systemic noise. Gene set enrichment analysis confirmed pathways in neurodegeneration, inflammation, oxidative phosphorylation, and sensory function, with brain-derived signals reproducibly detected in blood. Sex-stratified analyses revealed female-specific signatures linked to addiction and mood regulation, pathogen-driven immune responses, and nutrient-based neuroprotection. This research identifies a blood-based gene panel for AD risk: GFAP, TREM2, C1QC, C1QB, PLCG2, TXNIP, CD163, CAMK1D, DAPK1, CCND3, LRP10, and COQ10A. By coupling fine-grained brain biology with interpretable AI, this work enables equitable, population-scale early risk identification, supporting proactive interventions to maintain cognitive function and delay disease onset.Claire Xu
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237162062710.1609/aaaiss.v7i1.36942Evaluating Uncertainty in Deep Q-Network Ensembles for Trustworthy Anomaly Detection in Medical Imaging
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36943
Reliable anomaly detection is crucial for safe AI deployment in clinical imaging, yet most systems offer limited insight into prediction uncertainty or failure modes—key factors in medical decision-making. We analyze the uncertainty characteristics of a patch-level Deep Q-Network anomaly detection framework (DQN_AD) for brain MRI, trained with few annotated abnormal cases and designed to generalize to highly imbalanced clinical datasets. Our study links uncertainty to model errors, calibration, anomaly scores, spatial correspondence with ground truth, and selective evaluation. Results show that high-uncertainty predictions consistently coincide with error-prone regions, providing a strong signal for identifying potential failures. This study establishes the foundation for uncertainty-aware, reinforcement learning–based anomaly detection models that enhance reliability, interpretability, and clinical usability in large-scale MRI analysis.Zeduo ZhangYalda Mohsenzadeh
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237162863210.1609/aaaiss.v7i1.36943Data Drift Detection and Assessment for AI-hybrid Models Applied on Electrical Energy Consumption
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36875
Data drift evaluation is crucial in the operational step in the industry. In the real world, several drift types are usually contributing to drift detection, which may come from input data and output data distributions. In addition, the application context and the interpretation of these drift types add complexity to drift analysis. In this work, we apply drift detection in the specific domain of electrical transmission network systems. Three drift types, covariate, label, and concept drift, are considered and implemented on systems based on Physics-Informed Neural Networks (PINNs). The experimental results show the impact of each drift type and the evolution of their contributions when drift occurs in the industrial system. A contextual interpretation of the obtained results is also developed in this specific application domain for the three drift types.Faouzi AdjedMilad Leyli-AbadiElies GherbiMartin Gonzalez
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237111011710.1609/aaaiss.v7i1.36875ZAAS: Zonal Aware Anomaly Score for Time Series
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36876
Time series anomaly detection plays a critical role across domains from industrial monitoring to cybersecurity use cases. But its evaluation remains challenging. Traditional window level F1 score overweights long anomaly intervals, while heuristic “point‑adjusted” variants introduce bias by extending single detection across entire zones. We propose a Zone Normalized F1, which treats each true and each predicted anomaly interval as a unit, macro‑averaging precision and recall over intervals rather than windows. This eliminates length bias and yields a fairer comparison of detectors. We formalize the metric, illustrate its behavior on toy and real examples, and show how it complements existing protocols.Nabil Ait SaidElies GherbiFaouzi AdjedAchraf Kallel
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237111812110.1609/aaaiss.v7i1.36876Uncovering Systemic and Environment Errors in Autonomous Systems Using Differential Testing
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36877
Deploying autonomous agents in complex environments requires distinguishing between undesirable behaviors caused by the impreciseness of the agent's reasoning model or its policy (i.e. systemic agent error) and those due to inherently unsolvable tasks (environment error). We introduce AIProbe, a novel black-box differential testing framework to validate autonomous agents under varied and challenging environment configurations. We first describe how AIProbe generates diverse environmental configurations and tasks for testing the agent, by modifying configurable parameters using Latin Hypercube sampling. It then solves each generated task using a search-based planner, independent of the agent. By comparing the agent's performance to the planner's solution, AIProbe identifies whether failures are due to errors in the agent's model or policy, or due to unsolvable task conditions. We then demonstrate its broad applicability to both model-free and model-based agents operating in discrete and continuous domains. Our evaluation across multiple domains shows that AIProbe significantly outperforms state-of-the-art techniques in detecting unique errors, thereby contributing to a reliable deployment of autonomous agents.Yashwanthi AnandRahil P MehtaManish MotwaniSandhya Saisubramanian
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237112213010.1609/aaaiss.v7i1.36877LLMs Need to Go Beyond Computational Confidence Metrics to Establish Trust
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36878
While Large Language Models (LLMs) have demonstrated impressive capabilities, their widespread deployment is hindered by the lack of trustworthiness of their responses. Although existing trust scores and confidence metrics attempt to quantify the uncertainty, ensure safety, and reliability of LLM responses, they address only a single dimension of trust and fail to ensure trust holistically, in a user-centric manner. This lack of metric reliability and LLM trustworthiness poses significant risks in critical human-AI interaction applications. We posit that current confidence metrics and trust scores are insufficient to accurately measure trustworthiness and to ultimately inform how to establish calibrated user trust in these systems. We further argue that we need to move beyond computational assessments to enhance the measurement of trustworthiness of generative AI systems. We outline frameworks and approaches that can be incorporated into holistic trustworthy AI assessment and development in future research.Anil B MurthyLindsay Sanneman
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237113113610.1609/aaaiss.v7i1.36878Interactive Simulations of Backdoors in Neural Networks
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36879
This work addresses the problem of planting and defending cryptographic-based backdoors in artificial intelligence (AI) models. The motivation comes from our lack of understanding and the implications of using cryptographic techniques for planting undetectable backdoors under theoretical assumptions in the large AI model systems deployed in practice. Our approach is based on designing a web-based simulation playground that enables planting, activating, and defending cryptographic backdoors in neural networks (NN). Simulations of planting and activating backdoors are enabled for two scenarios: (a) in the extension of the NN model architecture to support digital signature verification, and (b) in the modified architectural block for non-linear operators. Simulations of backdoor defense against backdoors are available based on proximity analysis and provide an educational tool and a playground for a game of planting and defending against backdoors.Peter BajcsyMaxime Bros
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237113714410.1609/aaaiss.v7i1.36879Ethics2vec: Aligning Automatic Agents and Human Preferences
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36880
The interaction of humans and intelligent agents continues to grow and will be inevitable in the near future. Though intelligent agents are supposed to improve human experience (or make it more efficient) it is hard from a human perspective to grasp the ethical values which are explicitly or implicitly embedded in an agent behaviour. This is the well-known problem of alignment, which refers to the challenge of designing AI systems that align with human values, goals, and preferences. This problem is particularly challenging since most human ethical considerations refer to incommensurable (i.e. non-measurable and/or incomparable) values and criteria. Consider, for instance, a medical agent prescribing a treatment to a cancerous patient. How could it take into account (and/or weigh) incommensurable aspects like the value of a human life and the cost of the treatment? Now, the alignment between human and artificial values is possible only if we define a common space where a metric can be defined and used. This paper proposes to extend to ethics the conventional Anything2vec approach, which has been successful in plenty of similar and hard-to-quantify domains (ranging from natural language processing to recommendation systems and graph analysis). This paper proposes a way to map an automatic agent decision-making (or control law) strategy to a multivariate vector representation, which can be used to compare and assess the alignment with human values. The rationale is that if an automatic agent implements a decision-making strategy, this strategy is optimal with respect to some loss function. At the same time, if the human accepts to adhere to the agent strategy, this implicitly means that such agent strategy is also optimal wrt to a weighted sum of human criteria. By making such an assumption, it is possible to recover some constraints on the weights of the human criteria that the adoption of the agent strategy implies. The Ethics2Vec method is first introduced in the case of an automatic agent performing binary decision-making. Then, a vectorisation of an automatic control law (like in the case of a self-driving car) is discussed to show how the approach can be extended to automatic control settings.Gianluca Bontempi
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237114515210.1609/aaaiss.v7i1.36880Introducing RUM: A Methodological Contribution for Engineering Trustworthy AI Components in Industrial Systems
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36881
We introduce RUM, a unified, lifecycle-aware framework for facilitating the engineering and assessing Trustworthy AI Components—software units embedding both pure and statistical functions within real-world systems. Unlike model-centric evaluations, RUM treats AI Components as indivisible units whose behavior must be understood across specification, development, operation, and updating phases. This announcement paper presents a series of research articles that establish the foundation of RUM: (1) a formal argument for the atomic nature of Trustworthy AI Components; (2) a structured set of novel trust metrics, many of them being of non-aggregative nature, spanning the component's lifecycle; and (3) an operational framework introducing AI Blueprints to support runtime monitoring, human-in-the-loop usage, and temporal maintainability while facilitating the evolution of AI Components at different stages. RUM offers a coherent alternative to fragmented evaluation tools, aligning with the needs of AI deployment in industrial contexts.Martin GonzalezLoic CantatKevin Pasini
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237115316010.1609/aaaiss.v7i1.36881Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36882
Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61%) on images resembling social media content, raising significant and urgent privacy concerns.Oliver GraingeSania WaheedJack StilgoeMichael MilfordShoaib Ehsan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237116116810.1609/aaaiss.v7i1.36882Continuous Monitoring of Large-Scale Generative AI via Deterministic Knowledge Graph Structures
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36883
Generative AI (GEN AI) models have revolutionized diverse application domains but present substantial challenges due to reliability concerns, including hallucinations, semantic drift, and inherent biases. These models typically operate as black-boxes, complicating transparent and objective evaluation. Current evaluation methods primarily depend on subjective human assessment, limiting scalability, transparency, and effectiveness. This research proposes a systematic methodology using deterministic and Large Language Model (LLM)-generated Knowledge Graphs (KGs) to continuously monitor and evaluate GEN AI reliability. We construct two parallel KGs: a deterministic KG built using explicit rule-based methods, predefined ontologies, domain-specific dictionaries, and structured entity-relation extraction rules; and an LLM-generated KG dynamically derived from real-time textual data streams such as live news articles. Utilizing real-time news streams ensures authenticity, mitigates biases from repetitive training, and prevents adaptive LLMs from bypassing predefined benchmarks through feedback memorization. To quantify structural deviations and semantic discrepancies, we employ several established KG metrics including Instantiated Class Ratio (ICR), Instantiated Property Ratio (IPR), and Class Instantiation (CI). These metrics systematically evaluate critical structural properties, including class and property instantiation ratios, class depth and complexity, and inheritance patterns. An automated real-time monitoring framework continuously computes deviations between deterministic and LLM-generated KGs. By establishing dynamic anomaly thresholds based on historical structural metric distributions, our method proactively identifies and flags significant deviations, thus promptly detecting semantic anomalies or hallucinations. This structured, metric-driven comparison between deterministic and dynamically generated KGs delivers a robust and scalable evaluation framework. A demo website is currently live at ( anonymous ).Kishor Datta GuptaMohd Ariful HaqueHasmot AliMarufa KamalSyed Bahauddin AlamMohammad Ashiqur Rahman
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237116917610.1609/aaaiss.v7i1.36883The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36884
Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.Elyes HajjiAymen BouguerraFabio Arnez
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237117718410.1609/aaaiss.v7i1.36884Utilizing SBOM for Transparent AI Risk Communication
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36885
Value chains for AI systems are becoming increasingly complex and can consists of multiple actors that contribute services, tools, data, models and code. An efficient risk management along this value chain requires all actors to communicate potential risk sources and recommendations for mitigation. The Software Bill of Materials (SBOM) is a method from cybersecurity, that enables organizations to communicate information like licences, security vulnerabilities and dependencies of software components. SBOM raises increasing interest in the AI community to share information about AI components, like data and models. In this paper we discuss the suitability of SBOM for AI risk management along a value chain and show the potential but also gaps in current approaches.Lennard HelmerLisa FinkMaximilian Poretschkin
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237118518910.1609/aaaiss.v7i1.36885Enhancing Trustworthiness in VAD with Rule-Based VLM-LLM Explanations
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36886
Video Anomaly Detection is a critical task for identifying unusual events in video streams, with applications ranging from public safety surveillance to industrial monitoring. Traditional VAD methods, often based on reconstruction or prediction errors, excel at detecting deviations but typically lack semantic understanding, failing to explain why an event is anomalous. The recent advent of Vision-Language Models and Large Language Models has introduced a new paradigm, enabling systems to interpret and reason about video content in natural language. However, existing VLM/LLM-based approaches often focus either on rich, open-ended description or on structured, rule-based reasoning, but rarely both. In this paper, we address this gap by proposing a novel hybrid framework that synergizes the strengths of descriptive and deductive models. Our approach first leverages a powerful VLM to generate detailed, contextual scene descriptions. These descriptions are then fed into a rule-driven LLM, which uses a pre-induced set of contextual rules to make a final anomaly judgment and provide a human-readable explanation grounded in the specific rule that was violated. We validate our approach on the large-scale UCF-Crime dataset and conduct an analysis of key hyperparameters, including the VLM's input prompt and the number of frames used for description. Our results demonstrate the effectiveness of the proposed architecture and offer insights into building more interpretable, reliable, and context-aware VAD systems.Mohamed Ibn KhedherFaouzi AdjedJoseph Kattan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237119019710.1609/aaaiss.v7i1.36886Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36887
Current literature suggests that alignment faking is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based interventions are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for deceptive alignment evaluations across model sizes and deployment settings.J. Koorndijk
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237119820510.1609/aaaiss.v7i1.36887A Brief Overview of Key Quality Metrics for Knowledge Graph Solution Illustration on Digital NOTAMs
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36888
After a brief introduction of the Knowledge Graph (KG) technology, a subfield of symbolic AI used to represent and manage semantic information, the article is devoted to quality assessment, emphasizing the importance of developing trustworthy AI in such knowledge-based systems, particularly in safety-critical applications. In this context, we remind several metrics or methods that can be applied to KGs, along with examples of their implementation in the context of digital NOTAMs illustrated by the HLIF2024 Hackathon.Juliette MattioliLucas MattioliMartin Gonzalez
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237120621310.1609/aaaiss.v7i1.36888Challenges and Choices when Evaluating Alignment in Human-AI Systems
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36889
Aligning AI to human values is a current research endeavor where much focus goes to training AI systems to align with values, goals and tasks. But evaluating whether those aligned systems are actually better and more trusted by human users is an essential part of improving such systems. We present three challenges encountered in the evaluation of aligned AI systems. We present possible solutions to these challenges, discuss our own and alternative design choices, and outline next steps for AI alignment research to flourish.Jennifer C. McVayEwart J. de Visser
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237121422210.1609/aaaiss.v7i1.36889Grounded Instruction Understanding with Large Language Models: Toward Trustworthy Human-Robot Interaction
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36890
Understanding natural language as a representational bridge between perception and action is critical for deploying autonomous robots in complex, high-risk environments. This work investigates how large language models (LLMs) can support this bridge by interpreting unconstrained human instructions in urban disaster response scenarios. Leveraging the SCOUT corpus, a multimodal dataset capturing human-robot dialogue through Wizard-of-Oz experiments, we construct SCOUT++, aligning over 11,000 visual frames with language commands and robot actions. We evaluate three instruction classification approaches: a neural network trained on tokenized text, GPT-4 using text alone, and GPT-4 with synchronized visual input. Results show that while GPT-4 (text-only) outperforms traditional models in accuracy, its multimodal variant exhibits degraded performance, often producing vague or hallucinated outputs. These findings expose the challenges of reliably grounding language in visual context and raise questions about the trustworthiness of foundation models in safety-critical settings. We contribute SCOUT++, a reproducible multimodal pipeline, and benchmark results that shed light on the capabilities and current limitations of vision-language models for risk-sensitive human-robot interaction.Ekele OgbaduStephanie LukinCynthia Matuszek
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237122323110.1609/aaaiss.v7i1.36890Identifying the Supply Chain of AI for Trustworthiness and Risk Management in Critical Applications
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36891
Risks associated with the use of AI, ranging from algorithmic bias to model hallucinations, have received much attention and extensive research across the AI community, from researchers to end-users. However, a gap exists in the systematic assessment of supply chain risks associated with the complex web of data sources, pre-trained models, agents, services, and other systems that contribute to the output of modern AI systems. This gap is particularly problematic when AI systems are used in critical applications, such as the food supply, healthcare, utilities, law, insurance, and transport. We survey the current state of AI risk assessment and management, with a focus on the supply chain of AI and risks relating to the behavior and outputs of the AI system. We then present a proposed taxonomy specifically for categorizing AI supply chain entities. This taxonomy helps stakeholders, especially those without extensive AI expertise, to “consider the right questions” and systematically inventory dependencies across their organization’s AI systems. Our contribution bridges a gap between the current state of AI governance and the urgent need for actionable risk assessment and management of AI use in critical applications.Raymond K. ShehKaren Geappen
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237123223910.1609/aaaiss.v7i1.36891The Anatomy of a Trustworthy AI Answer: A Comparative Experiment for RAG Architectures
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36892
Retrieval-Augmented Generation (RAG) has become the go-to fix for LLM hallucinations. But its most common form, built on Vector Databases, is like a confident consultant who has only read the executive summaries. It's fluent, convincing, and adept at finding information that sounds right, but critically lacks the deep, verifiable connections between the facts. In high-stakes domains like medicine, this creates a dangerous new form of AI: one that is wrong with conviction. This paper provides a comparative experiment for distinguishing between answers that merely sound correct and those that are verifiably true. Our head-to-head evaluation of Vector-based versus Knowledge Graph-based RAG reveals a stark architectural choice. Our findings demonstrate that while Vector RAG produces a convincing but untraceable story, the Knowledge Graph approach delivers a factually correct answer with a verifiable evidence trail. This is the blueprint for building RAG systems that don't ask for your trust - they earn it by showing their work.Dippu Kumar SinghPraveen Chinapla Bharamappa
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237124024810.1609/aaaiss.v7i1.36892Rashomon in the Streets: Explanation Ambiguity in Scene Understanding
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36893
Explainable AI (XAI) is essential for validating and trusting models in safety-critical applications like autonomous driving. However, the reliability of XAI is challenged by the Rashomon effect, where multiple, equally accurate models can offer divergent explanations for the same prediction. This paper provides the first empirical quantification of this effect for the task of action prediction in real-world driving scenes. Using Qualitative Explainable Graphs (QXGs) as a symbolic scene representation, we train Rashomon sets of two distinct model classes: interpretable, pair-based gradient boosting models and complex, graph-based Graph Neural Networks (GNNs). Using feature attribution methods, we measure the agreement of explanations both within and between these classes. Our results reveal significant explanation disagreement. Our findings suggest that explanation ambiguity is an inherent property of the problem, not just a modeling artifact.Helge SpiekerJørn Eirik BettenArnaud GotliebNadjib LazaarNassim Belmecheri
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237124925610.1609/aaaiss.v7i1.36893Bridging AI and Health on Time Series Analysis and Explainability Using the Case Study of EEG Channel Selection Problem
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36894
Time series (TS) analysis is an active application area for Artificial Intelligence (AI) methods, where the objective is to analyze numeric quantities indexed by time for tasks like classification, forecasting, and abnormality detection. In health, TS manifests as biosignals like the electroencephalogram (EEG), where electrical signals from the brain are analyzed. AI and health communities can tremendously benefit each other in TS, with the former offering advanced analytical methods while the latter provides complex data sets and trust-sensitive use cases. But the communities also need to overcome confusing terminologies, hidden assumptions, and a lack of necessary domain contexts for result evaluation and interpretation. In this paper, we attempt to bridge the gap using the problem of channel selection in EEG. We outline challenges in working with EEG data, demonstrate via two experiments how simple explainable AI (XAI) methods can be quite effective for channel selection, irrespective of the EEG tasks/paradigms, and argue that recent TS trends in AI, like LLMs and XAI methods, can benefit health as well. We hope that this work will bring researchers working on TS problems at the intersection of AI and health, closer to work in AI trustworthiness so that they can better leverage results from their respective areas to overcome common challenges. All code and resources are released on GitHub to help others replicate.Vandana SrivastavaBiplav Srivastava
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237125726410.1609/aaaiss.v7i1.36894Query-Based Model Extraction Attack on GCN: A Surrogate Model Technique for Non-Euclidean Data
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36895
Machine learning (ML) models are facing serious threats from Model Extraction Attacks, in which a black-box model owned by a private service provider can be cloned to a surrogate model by an attacker pretending to be a client solely through query-based access. Unfortunately, most of the past studies only focus on ML models, which are trained on Euclidean spaces like images and texts, while model extraction attacks on Graph Neural Network (GNN) models containing node features and graph structure need to be explored. The respective study focuses on investigating and developing a model extraction attack strategy against a Graph Convolutional Network (GCN) model by simulating more realistic conditions for the attacker. The study begins by formalizing threat modeling based on GCN extraction attacks, categorizing potential threats in accordance with the levels of background knowledge accessible to the attacker, such as node attributes and neighbor connections. Subsequently, the study presents a novel method that leverages a learnable feature synthesis module in order to infer missing attributes of unknown neighbor nodes, evaluated using fidelity (85-90 percentage) and KL-divergence (0.28-0.10) to assess behavioral similarity with the victim model, rather than exact parameter recovery. Results demonstrate that even with partial knowledge, the majority of inputs in the target domain yield predictions identical to the original model.Sibtain SyedAlvi Ataur KhalilKishor Datta GuptaSaima JabeenMohammad Ashiqur Rahman
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237126527210.1609/aaaiss.v7i1.36895On Identifying Why and When Foundation Models Perform Well on Time-Series Forecasting Using Automated Explanations and Rating
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36896
Time-series forecasting models (TSFM) have evolved from classical statistical methods to sophisticated foundation models, yet understanding why and when these models succeed or fail remains challenging. Despite this known limitation, time-series forecasting models are increasingly used to generate information that informs real-world actions with equally real consequences. Understanding the complexity, performance variability, and opaque nature of these models then becomes a valuable endeavor to combat serious concerns about how users should interact with and rely on these models’ outputs. This work addresses these concerns by combining traditional explainable AI (XAI) methods with Rating Driven Explanations (RDE) to assess TSFM performance and interpretability across diverse domains and use cases. We evaluate four distinct model architectures: ARIMA, Gradient Boosting, Chronos (time-series specific foundation model), Llama (general-purpose; both fine-tuned and base models) on four heterogeneous datasets spanning finance, energy, transportation, and automotive sales domains. In doing so, we demonstrate that feature-engineered models (e.g., Gradient Boosting) consistently outperform foundation models (e.g., Chronos) in volatile or sparse domains (e.g., power, car parts) while providing more interpretable explanations, whereas foundation models excel only in stable or trend-driven contexts (e.g., finance).Michael ’Xander’ WidenerKausik LakkarajuJohn A. AydinBiplav Srivastava
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237127328210.1609/aaaiss.v7i1.36896Error Detection and Correction for Interpretable Mathematics in Large Language Models
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36897
Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.Yijin YangCristina CornelioMario LeivaPaulo Shakarian
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237128329210.1609/aaaiss.v7i1.36897Artificial Insurance: Exposing the Coverage, Controls, and Measurement Gaps of Insurance for AI Risks
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36957
<p>This paper argues that current insurance market is fundamentally mis-aligned with AI risk, creating significant coverage, control, and measurement gaps that threaten both organizations and insurers. Through analysis of insurance policy coverages, risk controls, and measurement approaches across 15 AI risk categories, we demonstrate that conventional insurance structures are inadequately addressing the unique challenges presented by AI systems. This misalignment stems from AI’s autonomous nature, probabilistic operations, opacity, and rapid development cycles, which conflict with insurance assumptions about human control, causality, deterministic failures, and stable risk environments. While some argue that existing policies sufficiently cover AI risks, our evidence shows that even the most relevant cyber and technology liability insurance products leave organizations exposed to significant AI-specific harms. Without deliberate evolution in AI risk transfer mechanisms, organizations face a protection gap while insurers confront potentially catastrophic unpriced exposure, creating an urgent need for risk transfer enablement between insurance and organizations using and deploying AI solutions.</p>Erin Kenneally
Copyright (c) 2025 Proceedings of the AAAI Symposium Series
2026-03-062026-03-067169069910.1609/aaaiss.v7i1.36957PlanOwl: Automated PDDL Files Generation from OWL Ontologies and Visual Language Models
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36944
Automated task planning traditionally relies on manually generated domain models, creating bottlenecks in scalability and requiring extensive domain expertise. This paper presents a novel framework to automate the process of generating Planning Domain Definition Language (PDDL) domains and problem files by integrating Web Ontology Language (OWL) ontologies with Visual Language Models (VLMs). Our approach leverages the rich semantic structure of OWL ontologies to systematically define domain classes, predicates, and actions, while VLMs ground abstract ontological concepts in concrete visual observations—automating the generation of instance‑specific planning problems. The proposed framework transforms ontological knowledge into PDDL domain files through a mapping algorithm that preserves semantic relationships and logical constraints. The VLM performs visual scene analysis to identify relevant objects, attributes, and spatial configurations for generating initial states, while natural language instructions are used to derive goal states. We evaluate the framework across multiple planning domains, demonstrating that it generates syntactically correct and semantically coherent PDDL domain and problem files directly from OWL ontologies, camera images, and natural language inputs. The resulting files are comparable in quality to those manually generated by planning experts and outperform previous automated systems in terms of semantic fidelity and adaptability.Mark AdamikPaolo Forte
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237163464310.1609/aaaiss.v7i1.36944Birds of a Different Feather Flock Together: Exploring Opportunities and Challenges in Animal-Human-Machine Teaming
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36945
Animal-Human-Machine (AHM) teams are a type of hybrid intelligence system wherein interactions between a human, AI-enabled machine, and animal members can result in unique capabilities greater than the sum of their parts. This paper calls for a systematic approach to studying the design of AHM team structures to optimize performance and overcome limitations in various applied settings. We consider the challenges and opportunities in investigating the synergistic potential of AHM team members by introducing a set of dimensions of AHM team functioning to effectively utilize each member’s strengths while compensating for individual weaknesses. Using three representative examples of such teams: security screening, search-and-rescue, and guide dogs, the paper illustrates how AHM teams can tackle complex tasks. We conclude with open research directions that this multidimensional approach presents for studying hybrid human-AI systems beyond AHM teams.Myke C. CohenXiaoyun YinDavid A. GrimmReuth Mirsky
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237164464910.1609/aaaiss.v7i1.36945Learning Multi-Modal Whole-Body Control for Real-World Humanoid Robots
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36946
A major challenge in humanoid robotics is designing a unified interface for commanding diverse whole-body behaviors, from precise footstep sequences to partial-body mimicry and joystick teleoperation. We introduce the Masked Humanoid Controller (MHC), a learned whole-body controller that exposes a simple yet expressive interface: the specification of masked target trajectories over selected subsets of the robot’s state variables. This unified abstraction allows high-level systems to issue commands in a flexible format that accommodates multi-modal inputs such as optimized trajectories, motion capture clips, re-targeted video, and real-time joystick signals. The MHC is trained in simulation using a curriculum that spans this full range of modalities, enabling robust execution of partially specified behaviors while maintaining balance and disturbance rejection. We demonstrate the MHC both in simulation and on the real-world Digit V3 humanoid, showing that a single learned controller is capable of executing such diverse whole-body commands in the real world through a common representational interface.Pranay DugarAayam ShresthaFangzhou YuBart van MarumAlan Fern
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237165065710.1609/aaaiss.v7i1.36946Language and Gesture in Virtual Reality: Is a Gesture Worth 1000 Words?
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36947
Robots are increasingly incorporating multimodal information and human signals to resolve ambiguity in embodied human-robot interaction. Harnessing signals such as gestures may expedite robot exploration in large, outdoor urban environments for supporting disaster recovery operations, where speech may be unclear due to noise or the challenges of a dynamic and dangerous environment. Despite this potential, capturing human gesture and properly grounding it to crowded, outdoor environments remains a challenge. In this work, we propose a method to model human gesture and ground it to spoken language instructions given to a robot for execution in large spaces. We implement our method in virtual reality to develop a workflow for faster future data collection. We present a series of proposed experiments that compare a language-only baseline to our proposed language supplemented by gesture approach, and discuss how our approach has the potential to reinforce the human’s intent and detect discrepancies in gesture and spoken instructions in these large and crowded environments.Padraig HigginsCory J. HayesStephanie LukinCynthia Matuszek
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237165866210.1609/aaaiss.v7i1.36947LiMPNet: Lightweight Multi-sensor Perception and DRL Navigation for Tiny Drones in Mapless Environments
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36948
Autonomous tiny drones face significant challenges in navigation due to strict constraints on size, weight, power, and onboard computational capacity. This paper presents a lightweight navigation framework that integrates basic multi-sensor perception with deep reinforcement learning (DRL) to enable safe, mapless flight in cluttered environments. We employ the Crazyflie 2.1 nano-drone, equipped with a grayscale camera and a multi-ranger deck, a laser-based distance sensor, for real-time obstacle detection and avoidance. A Proximal Policy Optimization (PPO) agent is trained within a ROS and Gazebo simulation environment to generate collision-free trajectories using fused visual and range data. The system is evaluated in two environments: a simple obstacle field, where the drone achieves a 100% success rate (112/112 episodes), and a densely cluttered map, where it reaches the target in 35% of trials (7/20). These results demonstrate that effective autonomous navigation is achievable using minimal sensing and low-computation models, making it well-suited for resource-constrained aerial platforms.Omer KurkutluArman Roohi
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237166366910.1609/aaaiss.v7i1.36948A Conceptual Primitive Decomposition of the Sally-Anne Test
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36949
Although large language models (LLMs) have been observed to perform at a human level in theory of mind tasks, deeper examinations and systematic testing of their performance in these domains is needed. Primitive decomposition representations show promise for building robotic systems with greater abilities for in-depth natural language understanding and generation. In this work we explore representations of theory of mind which are combinations of conceptual primitives, focusing on simulations of a Sally-Anne false-belief test. We demonstrate how primitive decompositions into the conceptual building blocks of image schemas and conceptual dependency can represent the attribution of false beliefs to intelligent agents. The exploration has consequences for generating controlled and linguistically varied tests posed in natural language as challenge problems for large language models and for cognitive representations more broadly.Jamie C. MacbethBoming ZhangSharmin Badhan
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237167067210.1609/aaaiss.v7i1.36949Multi-Modal Perception and Behavior Adaptation Models for Human State Understanding and Interaction Improvement in Robotic Touch
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36950
Robots that can physically interact with humans in a safe, comfortable, and intuitive manner can help in a variety of settings. However, perceptions of the users greatly affect the acceptability of such robots. Ability of the system to understand user's perception of the physical interaction as well as adapting robot's behaviors based on user perception and interaction context can facilitate acceptability of these robots. In this paper we propose a perception-based interaction adaptation framework. One main component of this framework is a multi-modal perception model which is grounded on the existing literature and is intended to provide a quantitative estimation of the human state- defined as the perceptions of the physical interaction- by using human, robot, and context information. This model is intended to be comprehensive in many physical Human-Robot Interaction (pHRI) scenarios. The estimated human state is fed to a context-aware behavior adaptation framework which recommends robot behaviors to improve human state using a learned behavior cost model and an optimization formulation. We show the potential and feasibility of such a human state estimation model by evaluating a reduced model, with data collected through a user study. Additionally, through some feature analysis, we aimed to shed light on future interaction designs for pHRI.Huy Quyen NgoRana Soltani Zarrin
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237167368210.1609/aaaiss.v7i1.36950The TRADE Middleware for Advanced Robotic Architectures
https://ojs.aaai.org/index.php/AAAI-SS/article/view/36951
Over the last decade, the Robot Operating System (ROS) has become the de facto standard for robotic middleware having significantly improved some of the shortcomings of version 1 with the release of version 2. Yet, while the focus of ROS 2 has been “downward” on the underlying communication layer, the interfaces “upward” to the robotic architectures implemented in ROS has received little attention. In this paper, we argue that robotic middleware can serve important roles for robotic architectures, in particular, cognitive robotic architecture, if the right kinds of interfaces are provided that allow for a tight integration between architecture and middleware. We introduce the Thinking Robots Agent Development Environment, TRADE, which is an extension of the previous Agent Development Environment, ADE, and provides advanced features for architecture integration and interactions between cognitive robotic architecture and the middleware layer. We describe several features in TRADE that are missing in ROS, in particular, system-wide locking mechanisms, service instrumentation, and middleware service calls and discuss how they can support the architecture developers with implementing advanced architectural features such as dialogue-based system debugging and configuration, or multi-effector multi-robot behavior coordination.Matthias Scheutz
Copyright (c) 2025 Association for the Advancement of Artificial Intelligence
2025-11-232025-11-237168369010.1609/aaaiss.v7i1.36951