Pic2Prep: A Multimodal Conversational Agent for Cooking Assistance
DOI:
https://doi.org/10.1609/aaai.v39i28.35359Abstract
As the demand for healthier, personalized culinary experiences grows, so does the need for advanced food computation models that offer more than basic nutritional insights. However, current food computation models lack the depth to provide actionable insights like ingredient substitution or alternative cooking actions to suit users’ dietary goals. To address this, we introduce and demonstrate Pic2Prep, a multimodal conversational system that generates detailed cooking instructions, actions and ingredient lists from both images and text provided by users. The system is developed using a novel dataset generated through Stable Diffusion, where the input consists of recipe titles and ingredient lists from the Recipe1M dataset to create synthesized food images with variations. This dataset is used to fine-tune the Bootstrapping Language-Image Pre-training (BLIP) model to extract cooking instructions and ingredients from food images. Pic2Prep also employs the CookGen model, a small-scale custom generative model to derive specific cooking actions from cooking instructions. A custom mapper, trained on the Mistral model, links these actions to the corresponding ingredients, creating a comprehensive understanding of the cooking process. The system features an interactive user interface that allows users to input images and ask targeted questions, receiving real-time responses.Downloads
Published
2025-04-11
How to Cite
Mana, R. P. K., Shyalika, C., Venkataramanan, R., Eswaramoorthi, D. L., & Sheth, A. P. (2025). Pic2Prep: A Multimodal Conversational Agent for Cooking Assistance. Proceedings of the AAAI Conference on Artificial Intelligence, 39(28), 29661-29663. https://doi.org/10.1609/aaai.v39i28.35359
Issue
Section
AAAI Demonstration Track