(1)

Yamazaki, K.; Vo, K.; Truong, Q. S.; Raj, B.; Le, N. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. AAAI 2023, 37, 3081-3090.