HunyuanWorld-Voyager
Visit Tool →Here’s a tight executive brief on HunyuanWorld‑Voyager, followed by a deeper dive you can scan or cherry‑pick.
TL;DR
- What it is: an open‑weights RGB‑D video diffusion model from Tencent that turns a single image + a user‑defined camera path into 3D‑consistent, explorable scene videos and aligned depth for direct 3D reconstruction (point clouds). (Hugging Face)
- Why it matters: gives you camera‑controllable “world roams” without a full 3D reconstruction pipeline; you can auto‑extend clips for minute‑long world exploration while preserving spatial layout. (ar5iv)
- Release & code: public weights + code Sept 2, 2025 on GitHub & Hugging Face. (GitHub, Hugging Face)
- Under the hood: builds on HunyuanVideo (DiT‑based) with geometry‑injected conditioning (partial RGB + depth), a world cache + point culling, and a data engine that auto‑recovers camera + metric depth to scale training. (ar5iv)
- How good: tops Stanford’s WorldScore benchmark overall (77.62) vs WonderWorld (72.69) and CogVideoX‑I2V (62.15). (GitHub)
- Caveats: heavy compute (≥60 GB VRAM for 540p), outputs aren’t true meshes (you reconstruct point clouds from RGB‑D), and a community license with territory limits (no EU/UK/KR) + no model‑training using Outputs. (GitHub, Hugging Face)
1) What Voyager actually does
Voyager takes one input image and a camera trajectory (e.g., forward, turn‑left/right) and generates a video where the viewpoint moves through a coherent 3D scene. It emits RGB frames and depth per frame, so you can reconstruct a 3D point‑cloud sequence or fuse into a scene representation. You can run it autoregressively to extend exploration while keeping geometry consistent. (Hugging Face, ar5iv)
Typical demos use ~49 frames (~2 s) per segment and chain segments for multi‑minute walks. (TechSpot)
2) How it works (nutshell)
Paper:Voyager: Long‑Range and World‑Consistent Video Diffusion for Explorable 3D Scene Generation (Huang et al., 2025). Core ideas: (ar5iv)
- Geometry‑injected conditioning: The model is conditioned on partial RGB and partial depth rendered from an evolving 3D cache of the world; adding depth reduces hallucinated occlusions. (ar5iv)
- World‑consistent video diffusion: A DiT‑based video model (from Hunyuan‑Video) jointly generates RGB + depth aligned to the camera path and prior world state. (ar5iv)
- World cache + point culling: Each generated frame updates a 3D point‑cloud cache; redundant points are culled to control memory; the cache conditions the next clip for long‑range consistency. (ar5iv)
- Smooth video sampling (autoregressive): Generates long sequences by stitching clips while smoothing transitions to avoid flicker/drift. (ar5iv)
- Scalable data engine: A pipeline estimates camera poses and metric depth for arbitrary videos (incl. Unreal Engine renders), yielding 100k+ training pairs without manual 3D labels. (ar5iv)
3) Training data & backbone
- 100k+ curated video clips (real + Unreal Engine synthetic), each with estimated pose + metric depth to align scales. (ar5iv)
- The video diffusion backbone is derived from Hunyuan‑Video (3D‑VAE + DiT), adapted for image‑to‑video with geometry‑aware conditioning. (ar5iv)
4) Results & benchmarks
- On WorldScore (Stanford’s unified benchmark for “world generation”), Voyager scores 77.62 overall; highlights include style consistency and subjective quality, with camera control second only to WonderWorld. (GitHub, arXiv)
- The Voyager README shows the comparative table (vs WonderWorld, WonderJourney, Gen‑3, CogVideoX‑I2V, etc.). (GitHub)
- About WorldScore: evaluates controllability (camera/object), quality (3D & photometric consistency, style), and dynamics across 3K test prompts. (Haoyi Duan, arXiv)
5) What you can build with it (examples)
- Camera‑controllable world roams from a single reference photo (forward/back/left/right/turn).
- Fast 3D reconstruction from RGB‑D video (point clouds) for previews, previz, or scene blocking.
- Depth estimation/video transfer use cases piggyback on the joint RGB‑D outputs. (GitHub, Hugging Face)
6) Limits & caveats (practical)
- Not a full mesh world: Voyager outputs RGB‑D video, not production‑ready meshes; you’ll reconstruct point clouds (and can then try meshing), so don’t expect game‑ready assets out of the box. (TechSpot)
- Compute‑heavy: For 540p generation, expect ≥60 GB VRAM (80 GB recommended). Multi‑GPU xDiT scaling is supported; Tencent reports ~6.7× latency reduction on 8× H20 vs single‑GPU for a 512×768, 49‑frame, 50‑step run. (GitHub)
- Long‑range drift: Autoregressive stitching holds for minutes, but very long or complex paths (e.g., continuous 360° rotations) can accumulate errors. (TechSpot)
7) How to get & run it
- Code & weights: GitHub repo + Hugging Face model card/weights. (GitHub, Hugging Face)
- Quick start (single‑GPU):
python3 sample_image2video.py \ --model HYVideo-T/2 \ --input-path examples/case1 \ --prompt "An old-fashioned European village with thatched roofs on the houses." \ --i2v-stability --infer-steps 50 --flow-reverse --flow-shift 7.0 \ --embedded-cfg-scale 6.0 --seed 0 --use-cpu-offload \ --save-path ./resultsCamera paths you can generate with the helper script:forward | backward | left | right | turn_left | turn_right. (GitHub) - Multi‑GPU (xDiT/USP): set
--ulysses-degree×--ring-degreeto the GPU count; e.g., 8‑GPU parallel inference. (GitHub) - Local Gradio demo:
python3 app.py(upload an image, pick camera direction, enter prompt). (GitHub)
8) Licensing & usage constraints (read this)
License:Tencent HunyuanWorld‑Voyager Community License (posted Sept 2, 2025). Key points:
- Territory restriction:Not licensed for use in the EU, UK, or South Korea. (Hugging Face)
- Large‑scale use: If your products/services had >1 million MAU on the release date, you must request a separate license before exercising rights. (Hugging Face)
- No training other models: You must not use the Outputs (or the model) to improve other AI models (distillation/synthetic‑data training, etc.). (Hugging Face)
- Attribution encouragement: Tencent encourages a blogpost and marking products “Powered by Tencent Hunyuan” (not a hard requirement, but called out). (Hugging Face)
- HF card also marks the model EU‑gated. (Hugging Face)
9) Voyager vs. HunyuanWorld 1.0 (how they fit)
- Voyager: fast image→RGB‑D video with camera control; great for roaming videos and quick 3D point‑cloud recon; built on video diffusion + world cache. (GitHub, ar5iv)
- HunyuanWorld 1.0: a complementary world‑generation framework (text/image→panorama proxies → layered 3D mesh) with mesh export and interactive potential. Voyager was announced alongside HW 1.0 and listed in its “News.” (GitHub)
10) Sources & where to explore more
- Primary: Voyager paper & HF card (model card + assets). (ar5iv, Hugging Face)
- Code & instructions:Voyager GitHub (requirements, CLI, xDiT scaling, demo). (GitHub)
- HW 1.0 repo & paper for the mesh‑based pipeline and context. (GitHub, arXiv)
- News / explainers: TechSpot, Gigazine, Techmeme roundups. (Good for caveats like clip length and compute needs.) (TechSpot, GIGAZINE, Techmeme)
11) Quick adoption notes (actionable)
- Best for: previz, cinematic roams from reference stills, rapid 3D exploration prototypes.
- Not ideal for: turnkey, production‑ready game worlds or clean mesh assets without additional reconstruction/cleanup.
- Hardware planning: target 80 GB GPUs for comfort; plan multi‑GPU if you need minutes‑long 720p content in reasonable time. (GitHub)
- Compliance: confirm your jurisdiction and MAU status vs. the license before shipping anything. (Hugging Face)