Multi-Modal Hand-to-Mouth Gesture Recognition in Activity-Oriented RGB-Thermal Footage (Student Abstract)
DOI:
https://doi.org/10.1609/aaai.v39i28.35254Abstract
Health-risk behaviors such as overeating and smoking have a profound impact on public health, making their monitoring and mitigation critical. Wearable RGB-Thermal cameras are being employed to monitor these behaviors by capturing hand-to-mouth (HTM) gestures, which are central to them. However, detection models relying on single modalities—either RGB or thermal—often struggle to accurately distinguish these confounding gestures due to inherent sensor limitations, such as sensitivity to lighting conditions or thermal occlusions. We present a family of fusion models that integrate RGB and thermal video data using early-, decision- , and a novel mid-fusion architecture, RGB-Thermal Fusion Video Network (RTFVNet), designed to enhance the recognition of HTM gestures associated with eating and smoking. Our evaluation shows that while decision fusion achieves the highest F1-score of 88% (0.44 TFLOPs), RTFVNet offers an optimal balance between performance (85%) and complexity (0.37 TFLOPs) for gesture classification of eating, smoking, and non-gesture activities.Downloads
Published
2025-04-11
How to Cite
Fernandes, G., Lu, M., Shahabi, F., Zheng, J., Katsaggelos, A., & Alshurafa, N. (2025). Multi-Modal Hand-to-Mouth Gesture Recognition in Activity-Oriented RGB-Thermal Footage (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29368–29370. https://doi.org/10.1609/aaai.v39i28.35254
Issue
Section
AAAI Student Abstract and Poster Program