Multi-Modal Hand-to-Mouth Gesture Recognition in Activity-Oriented RGB-Thermal Footage (Student Abstract)

Authors

  • Glenn Fernandes Northwestern University, Chicago, IL
  • Meixi Lu Northwestern University, Chicago, IL
  • Farzad Shahabi Northwestern University, Chicago, IL
  • Jiayi Zheng Northwestern University, Chicago, IL
  • Aggelos Katsaggelos Northwestern University, Chicago, IL
  • Nabil Alshurafa Northwestern University, Chicago, IL

DOI:

https://doi.org/10.1609/aaai.v39i28.35254

Abstract

Health-risk behaviors such as overeating and smoking have a profound impact on public health, making their monitoring and mitigation critical. Wearable RGB-Thermal cameras are being employed to monitor these behaviors by capturing hand-to-mouth (HTM) gestures, which are central to them. However, detection models relying on single modalities—either RGB or thermal—often struggle to accurately distinguish these confounding gestures due to inherent sensor limitations, such as sensitivity to lighting conditions or thermal occlusions. We present a family of fusion models that integrate RGB and thermal video data using early-, decision- , and a novel mid-fusion architecture, RGB-Thermal Fusion Video Network (RTFVNet), designed to enhance the recognition of HTM gestures associated with eating and smoking. Our evaluation shows that while decision fusion achieves the highest F1-score of 88% (0.44 TFLOPs), RTFVNet offers an optimal balance between performance (85%) and complexity (0.37 TFLOPs) for gesture classification of eating, smoking, and non-gesture activities.

Published

2025-04-11

How to Cite

Fernandes, G., Lu, M., Shahabi, F., Zheng, J., Katsaggelos, A., & Alshurafa, N. (2025). Multi-Modal Hand-to-Mouth Gesture Recognition in Activity-Oriented RGB-Thermal Footage (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29368–29370. https://doi.org/10.1609/aaai.v39i28.35254