Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric–exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation. Our code is released at GitHub.
In this work, we propose WorldWander, an in-context learning framework that bridges first-person (egocentric) and third-person (exocentric) perspectives in video generation, enabling immersive and character-centric exploration of virtual worlds. Building upon advanced video diffusion transformers, WorldWander integrates the In-Context Perspective Alignment paradigm and the Collaborative Position Encoding strategy, which jointly model perspective correspondence without relying on auxiliary networks. The overall pipeline is shown in the following:
@article{song2025worldwander,
title={WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation},
author={Song, Quanjian and Song, Yiren and Peng, Kelly and Gao, Yuan and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2511.22098},
year={2025}
}