Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Authors

  • Fei Shen Nanjing University of Science and Technology Tencent AI Lab
  • Hu Ye Tencent AI Lab
  • Sibo Liu Tencent AI Lab
  • Jun Zhang Tencent AI Lab
  • Cong Wang Tencent AI Lab
  • Xiao Han Tencent AI Lab
  • Yang Wei Tencent AI Lab

DOI:

https://doi.org/10.1609/aaai.v39i7.32728

Abstract

Recent research showcases the considerable potential of conditional diffusion models for generating consistent stories. However, current methods, which primarily generate stories in a caption-dependent manner, often overlook the importance of contextual consistency and the relevance of frames during sequential generation. To address this, we propose a novel Rich-contextual Conditional Diffusion Models (RCDMs), a two-stage approach designed to enhance story generation's semantic consistency and temporal consistency. Specifically, in the first stage, the frame-prior transformer diffusion model is presented to predict the frame semantic embedding of the unknown clip by aligning the semantic correlations between the captions and frames of the known clip. The second stage establishes a robust model with rich contextual conditions, including reference images of the known clip, the predicted frame semantic embedding of the unknown clip, and text embeddings of all captions. By jointly injecting these rich contextual conditions at the image and feature levels, RCDMs can generate semantic and temporal consistency stories. Moreover, RCDMs can generate consistent stories with a single forward inference compared to autoregressive models. Our qualitative and quantitative results demonstrate that our proposed RCDMs outperform in challenging scenarios.

Downloads

Published

2025-04-11

How to Cite

Shen, F., Ye, H., Liu, S., Zhang, J., Wang, C., Han, X., & Wei, Y. (2025). Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 39(7), 6785–6794. https://doi.org/10.1609/aaai.v39i7.32728

Issue

Section

AAAI Technical Track on Computer Vision VI