Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Shijie Geng; Peng Gao; Moitreya Chatterjee; Chiori Hori; Jonathan Le Roux; Yongfeng Zhang; Hongsheng Li; Anoop Cherian

doi:10.1609/aaai.v35i2.16231

Authors

Shijie Geng Rutgers University
Peng Gao The Chinese University of Hong Kong
Moitreya Chatterjee University of Illinois at Urbana Champaign
Chiori Hori Mitsubishi Electric Research Laboratories (MERL)
Jonathan Le Roux Mitsubishi Electric Research Laboratories (MERL)
Yongfeng Zhang Rutgers University
Hongsheng Li The Chinese University of Hong Kong
Anoop Cherian Mitsubishi Electric Research Laboratories (MERL)

DOI:

https://doi.org/10.1609/aaai.v35i2.16231

Keywords:

Language and Vision, Video Understanding & Activity Analysis, Question Answering, Language Grounding & Multi-modal NLP

Abstract

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information