FreeMem: Enhancing Consistency in Long Video Generation via Tuning-Free Memory
DOI:
https://doi.org/10.1609/aaai.v40i10.37783Abstract
Text-to-Video (T2V) generation has advanced greatly, yet maintaining consistency remains challenging, especially for tuning-free long video generation. We attribute the consistency problem to cumulative deviations for long video generation at three levels: the random noise lacking correlation results initial deviation between frames; discrepancy in semantic feature tokens between denoising network blocks gradually accumulates as the frame count grows, leading to greater deviations; attention mechanisms struggle to capture global relationships across distant frames in long videos. To address these, we propose FreeMem, a tuning-free framework leveraging hierarchical memory update and injection: the noise memory stabilizes consistency by manipulating low and high frequency components in the initial noise space; the token memory combats inconsistency through adaptive fusion of historical and current semantic feature tokens between denoising network blocks; and the attention memory establishes persistent cache to model long-range relationships within self attention layers. Evaluated on VBench, FreeMem improves subject and background consistency matrics across various methods, offering a practical solution for low-cost, high-consistency long video generation.Downloads
Published
2026-03-14
How to Cite
Peng, J., Lin, D., Xu, Z., Lu, H., Liu, R., Xie, W., Wang, M., Liang, L., Wang, Y., & Guo, Q. (2026). FreeMem: Enhancing Consistency in Long Video Generation via Tuning-Free Memory. Proceedings of the AAAI Conference on Artificial Intelligence, 40(10), 8340-8348. https://doi.org/10.1609/aaai.v40i10.37783
Issue
Section
AAAI Technical Track on Computer Vision VII