Ruan, L., Hu, A., Song, Y., Zhang, L., Zheng, S., & Jin, Q. (2023). Accommodating Audio Modality in CLIP for Multimodal Processing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8), 9641–9649. https://doi.org/10.1609/aaai.v37i8.26153