SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

1Monash University 2The Chinese University of Hong Kong
3National University of Singapore 4South China University of Technology
*Equal contribution. Project lead. Corresponding authors.
Accepted by NeurIPS 2025

TL;DR: We propose SceneDecorator, a training-free framework for scene-oriented story generation. It integrates VLM-Guided Scene Planning and Long-Term Scene-Sharing Attention to tackle two key challenges: scene planning and scene consistency, which have been largely overlooked by previous work.

SceneDecorator

teaser

Abstract

Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a "global-to-local" manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject style diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.

Methodology

In this work, we design a training-free framework called SceneDecorator, to address two key challenges in story generation: scene planning and scene consistency. SceneDecorator comprises two core techniques: (i) VLM-Guided Scene Planning. Leveraging a powerful Vision-Language Model (VLM) as a director, it decomposes user-provided themes into local scenes and story sub-prompts in a ''global-to-local'' manner. (ii) Long-Term Scene-Sharing Attention. By simultaneously integrating mask-guided scene injection, scene-sharing attention, and extrapolable noise blending, it maintains subject style diversity and long-term scene consistency in story generation. Overall framework is shown below:


pipeline

Qualitative Comparisons

qualitative_comparisons

More Applications

Manual Scene Input and Consistent Character

app1-2

Generation with Generative Tools

app3

Generation with Evolving Scenes

app4

BibTeX

XXX