DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract)


  • Linpeng Sun Texas Tech University
  • Victor S. Sheng Texas Tech University




Computer Vision, Machine Learning, Applications Of AI


With the help of Vision transformers (ViTs), medical image segmentation was able to achieve outstanding performance. In particular, they overcome the limitation of convolutional neural networks (CNNs) which rely on local receptive fields. ViTs use self-attention mechanisms to consider relationships between all image pixels or patches simultaneously. However, they require large datasets for training and did not perform well on capturing low-level features. To that end, we propose DDViT, a novel ViT model that unites a CNN to alleviate data-hunger for medical image segmentation with two multi-scale feature representations. Significantly, our approach incorporates a ViT with a plug-in domain adapter (DA) with Double-Level Fusion (DLF) technique, complemented by a mutual knowledge distillation paradigm, facilitating the seamless exchange of knowledge between a universal network and specialized domain-specific network branches. The DLF framework plays a pivotal role in our encoder-decoder architecture, combining the innovation of the TransFuse module with a robust CNN-based encoder. Extensive experimentation across diverse medical image segmentation datasets underscores the remarkable efficacy of DDViT when compared to alternative approaches based on CNNs and Transformer-based models.



How to Cite

Sun, L., & Sheng, V. S. (2024). DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 38(21), 23661-23663. https://doi.org/10.1609/aaai.v38i21.30516