Humor Knowledge Enriched Transformer for Understanding Multimodal Humor
Keywords:Language Grounding & Multi-modal NLP
AbstractRecognizing humor from a video utterance requires understanding the verbal and non-verbal components as well as incorporating the appropriate context and external knowledge. In this paper, we propose Humor Knowledge enriched Transformer (HKT) that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge. We incorporate humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language. We encode all the language, acoustic, vision, and humor centric features separately using Transformer based encoders, followed by a cross attention layer to exchange information among them. Our model achieves 77.36% and 79.41% accuracy in humorous punchline detection on UR-FUNNY and MUStaRD datasets -- achieving a new state-of-the-art on both datasets with the margin of 4.93% and 2.94% respectively. Furthermore, we demonstrate that our model can capture interpretable, humor-inducing patterns from all modalities.
How to Cite
Hasan, M. K., Lee, S., Rahman, W., Zadeh, A., Mihalcea, R., Morency, L.-P., & Hoque, E. (2021). Humor Knowledge Enriched Transformer for Understanding Multimodal Humor. Proceedings of the AAAI Conference on Artificial Intelligence, 35(14), 12972-12980. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/17534
AAAI Technical Track on Speech and Natural Language Processing I