· Nolwen Brosson · Blog · 8 min read
Genie 3 (Google DeepMind): the real-time world model generating interactive worlds
In August 2025, Google DeepMind introduced Genie 3, a “world model”: an AI system that can create an interactive 3D world from a simple description. In practice, you can move through it like in a video game, in real time (around 24 frames per second), at 720p, and the world stays coherent for a few minutes.
In late January 2026, Google made the idea more accessible through Project Genie, an online demo that lets you generate a world, explore it, and then modify it (“remix” it) to create variations.
The point goes beyond a flashy demo. Genie 3 signals a bigger shift: AI is no longer only producing content (text, images, video). It can simulate interactive environments, which opens doors much closer to gaming, training, and simulation.
World model: a simple definition
A model that predicts “what happens next”
A world model is a system that learns the dynamics of an environment:
- what is likely to happen next,
- how the state of the world evolves,
- and how an agent’s actions change that state.
A classic video model generates a sequence “like a movie.” A world model has to stay consistent when you change direction, go back, or trigger an event.
The real challenge: coherence + interactivity + real time
DeepMind highlights three key properties:
- real time (smooth interaction, ~20–24 fps),
- interactive and controllable (navigation and events),
- photorealistic quality and stability when you revisit an area.
It’s exactly this combination that makes the problem both exciting and hard.
Genie 3: what DeepMind is actually claiming
Generating playable environments from a prompt
The idea is that a text prompt describes the setting, the viewpoint, and sometimes the “style” (realistic, FPV drone, ancient, etc.). Genie 3 then produces a world you can move through.
Real-time navigation and “promptable world events”
Genie 3 isn’t just “walking down a generated corridor.” DeepMind also describes promptable events: you can request a change (weather, appearance of objects/characters, and so on) and the world adapts.
This matters because it feels like a “what-if” control primitive. That’s useful for simulation, not just for entertainment.
A concrete use case: training embodied agents
DeepMind says it generated worlds to test an agent like SIMA (a generalist agent for 3D environments), to check that Genie 3 can support goal-oriented scenarios and longer action sequences.
Theoretical workings: what does a world model architecture like Genie 3 look like?
DeepMind does not publish every implementation detail of Genie 3. Still, you can infer a plausible architecture based on:
- what they describe (real time, coherence, events, action space),
- and the earlier “Genie” line of work, where the core idea is an action-driven generative world learned from videos.
1) A compact representation of the world (spatio-temporal tokenization)
To generate quickly, the model does not build the world pixel by pixel. It encodes the scene into a latent representation that is easier to model than raw frames (visual and spatio-temporal “tokens”), like many modern video models.
Why?
Because running a world at 24 fps comes with a tight compute budget. Compression plus incremental generation becomes essential.
2) A dynamics model: predicting the next state conditioned on actions
At the core of a world model, there’s a predictor:
given the current state + the user’s action, what is the next observation?
In the Genie (2024) paper, DeepMind explains a “world model” built around three main components:
1) A video tokenizer: compress video into “units”
Raw video is huge (millions of pixels per frame). The tokenizer compresses each frame into a sequence of visual tokens, like small “chunks” of information that are easier for the model to handle.
Key idea: instead of predicting pixels, the model predicts tokens.
2) An autoregressive dynamics model: predicting what comes next
They train a model to answer:
“If I’m at this exact moment (world state), what’s the most likely next step?”
“Autoregressive” means it generates step by step: it predicts the next token, then the next, and so on. Chaining those predictions produces the next frame, then the next, which becomes a video.
3) A latent action model: learning “actions” without a controller
The big issue is that internet videos don’t come with action logs (you don’t know when the camera turns, when the character moves forward, etc.). So DeepMind learns latent actions: pseudo-actions discovered automatically from the changes observed in videos.
Intuitive examples of latent actions:
- move forward / backward
- turn left / right
- go up / down
- zoom in / zoom out
These actions are not labeled by humans. The model learns its own action representation that best explains how one frame transforms into the next.
3) Visual coherence: memory + stable objects (the hard part)
DeepMind explains that coherence (“consistency”), meaning a place doesn’t change from one second to the next, is not achieved by reconstructing an explicit 3D scene (like NeRF or Gaussian Splatting). Instead, the model generates each frame on the fly, based on the prompt and the user’s behavior.
But for the world not to “reshuffle” as soon as you turn around, the model needs a form of memory. It reuses information from what it has already shown (objects, places, details) to stay stable when you return to the same area.
4) “World events”: text control on top of the simulation
Promptable events act like an extra control channel. Instead of only acting through navigation inputs, you inject an instruction that constrains what happens next (“it starts raining,” “a vehicle arrives,” etc.).
From a product perspective, this is powerful. You move from a world you simply explore to a world you can actively direct, which opens up professional use cases (training, safety tests, scenario design).
Applications: where Genie 3 could matter
1) Rapid prototyping for games and interactive experiences
The obvious one is gaming: generate a playable scene from a prompt, test mood and “feel,” try movement and camera behaviors, without a heavy 3D pipeline.
It’s not a new game engine (not yet), but it’s a very fast iteration machine.
2) Simulation for robotics, logistics, and industry
DeepMind explicitly mentions robotics and the ability to simulate varied scenarios.
The value is generating environments that are realistic enough and, more importantly, highly diverse. That helps train control policies, test behaviors, and create rare edge cases.
Concrete industry examples:
- warehouse training (endless layout and obstacle variations),
- safety testing (unexpected behaviors),
- training “assistant” agents that navigate 3D environments.
3) Training, education, and professional practice
DeepMind also points to education and training: put learners in a situation, multiply scenarios, observe mistakes, try again.
That could include safety procedures, step-by-step interventions, or even history and geography through generated environments.
4) Creation: animation, fiction, and previs
DeepMind’s “Models” positioning also highlights animation and storytelling, plus simulation of natural phenomena (weather, water, lighting).
For creators, the key isn’t only generating a video. It’s composing a scene by manipulating it.
Important limitations to keep in mind
DeepMind lists several clear limitations:
- Multi-agent interactions: making multiple characters coexist realistically (each with goals, movements, reactions) is hard. The model has to keep positions, collisions, intentions, and social dynamics coherent, without odd glitches or “teleporting” behavior.
- Real-world geography: Genie 3 is not designed to reproduce a real place exactly. It can generate “something that looks like,” but not a meter-accurate copy of an existing street or building. So it’s not a precise “generative Google Earth.”
- Text rendering: readable text inside the world (signs, labels, posters) is still fragile for generative models. Letters can warp, words can drift, and the same sign may change from frame to frame. It works better when the text is explicitly provided and strongly constrained.
- Duration: coherence holds over a short window (a few minutes): objects, scenery, and general logic remain stable. Over longer periods, the world is more likely to drift (details changing, contradictions, forgetting), because persistent state over long horizons is much harder.
- Limited action space: today, the user (or an agent) mostly has “simple” actions like moving and controlling the camera. “Events” (for example, “it starts raining,” “a car arrives”) feel more like instructions imposed on the world than fine-grained physical interaction (grabbing, pushing, assembling, opening mechanisms, etc.).
In other words: Genie 3 is already impressive, but it doesn’t replace a AAA engine, a certified physics simulation, or a millimeter-accurate industrial digital twin.
What this changes for product and business
Teams will prototype differently
If your product involves 3D, simulation, interactive training, or immersive marketing, a world model changes the economics of experimentation:
- you move from “building a level” to “describing an intent,”
- you test earlier,
- you kill weak ideas faster.
Value shifts from production to orchestration
The near future often looks like this:
- AI generates the world,
- software layers frame the experience (rules, goals, scoring, tracking),
- data instrumentation (telemetry, user feedback).
For teams that want to get started now
Even without direct access to the most advanced models, you can prepare:
- identify your use cases (training, game, simulator, 3D config),
- define your control constraints (required actions, duration, multi-agent needs),
- design the orchestration layer (rules, goals, UI, analytics),
- plan a human-in-the-loop strategy (validation, moderation, QA).
