Jia, M., Meng, W., Fu, Z., Li, Y., Zeng, Q., Zhang, Y., … Zhang, X. (2026). Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction. Proceedings of the AAAI Conference on Artificial Intelligence, 40(7), 5341–5349. https://doi.org/10.1609/aaai.v40i7.37450