Multi-Modal Instruction-Tuning Small-Scale Language-and-Vision Assistant for Semiconductor Electron Micrograph Analysis
DOI:
https://doi.org/10.1609/aaaiss.v3i1.31205Keywords:
Large Multi-Modal Models (LMMs), Visual Question Answering (VQA), Vision-Language Instruction Tuning (VLIT), Parameter Efficient Fine-Tuning (PEFTs), Semiconductor ScienceAbstract
We present a novel framework for analyzing and interpreting electron microscopy images in semiconductor manufacturing using vision-language instruction tuning. The framework employs a unique teacher-student approach, leveraging pretrained multimodal large language models such as GPT-4 to generate instruction-following data for zero-shot visual question answering (VQA) and classification tasks, customizing smaller multimodal models (SMMs) for microscopy image analysis, resulting in an instruction tuned language-and-vision assistant. Our framework merges knowledge engineering with machine learning to integrate domain-specific expertise from larger to smaller multimodal models within this specialized field, greatly reducing the need for extensive human labeling. Our study presents a secure, cost-effective, and customizable approach for analyzing microscopy images, addressing the challenges of adopting proprietary models in semiconductor manufacturing.Downloads
Published
2024-05-20
Issue
Section
Empowering Machine Learning and Large Language Models with Domain and Commonsense Knowledge