WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

1Show Lab, National University of Singapore 2First Intelligence
*Equal contribution. Corresponding author.

TL;DR: We propose WorldWander, an in-context learning framework for translating between egocentric and exocentric worlds in video generation. We also release EgoExo-8K, a large-scale dataset containing synchronized egocentric–exocentric triplets.

Teaser

Abstract

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric–exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation. Our code is released at GitHub.

Method

In this work, we propose WorldWander, an in-context learning framework that bridges first-person (egocentric) and third-person (exocentric) perspectives in video generation, enabling immersive and character-centric exploration of virtual worlds. Building upon advanced video diffusion transformers, WorldWander integrates the In-Context Perspective Alignment paradigm and the Collaborative Position Encoding strategy, which jointly model perspective correspondence without relying on auxiliary networks. The overall pipeline is shown in the following:

framework


EgoExo-8K

Visual Gallery

Exocentric-to-Egocentric

Egocentric-to-Exocentric

Qualitative Comparison

Exocentric-to-Egocentric

Egocentric-to-Exocentric

Ablation Study

BibTeX

@article{song2025worldwander,
    title={WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation},
    author={Song, Quanjian and Song, Yiren and Peng, Kelly and Gao, Yuan and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2511.22098},
    year={2025}
}