[1]

D. Wang and D. Xiong, “Efficient Object-Level Visual Context Modeling for Multimodal Machine Translation: Masking Irrelevant Objects Helps Grounding”, AAAI, vol. 35, no. 4, pp. 2720-2728, May 2021.