Peng, M., Wang, C., Shi, Y., & Zhou, X.-D. (2023). Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2), 2038–2046. https://doi.org/10.1609/aaai.v37i2.25296