Explained: Google DeepMind’s Genie, an AI model that creates virtual worlds from image prompts

This AI model could soon enable users to create their own video games. Here’s why this experimental model is revolutionary.

Homepage of Google Deepmind Website magnified on logo with magnifying glass. (Wikimedia Commons)

The biggest draw of video games is the escapism or the fantasy of a world far removed from our immediate reality. Now, imagine if you get the ability to create your own world. Well, researchers at Google DeepMind have come up with something that will enable you to create your own fictional world, similar to the outlandish landscapes seen in high-octane games.

Google DeepMind has just introduced Genie, a new model that can generate interactive video games from just a text or image prompt. That too without any prior training on game mechanics (which are essentially rules, elements, and processes that make up a game).

What is Genie?

According to the official Google DeepMind blog post, Genie is a foundation world model that is trained on videos sourced from the Internet. The model can “generate an endless variety of playable (action-controllable) worlds from synthetic images, photographs, and even sketches.”

The research paper ‘Genie: Generative Interactive Environments’ states that Genie is the first generative interactive environment that has been trained in an unsupervised manner from unlabelled internet videos. When it comes to size, Genie stands at 11B parameters and consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model.

These technical specifications let Genie act in generated environments on a frame-by-frame basis even in the absence of training, labels, or any other domain-specific requirements.

What does Genie do?

According to the research paper, Genie is a new kind of generative AI that enables anyone – even children – to dream up and step into generated worlds similar to human-designed simulated environments. Genie can be prompted to generate a diverse set of interactive and controllable environments although it is trained on video-only data.

In simple words, we have seen numerous generative AI models that produce creative content with language, images and even videos. Genie is a breakthrough as it makes playable environments from a single image prompt.

Story continues below this ad

Try and remember the scene in Harry Potter and the Philosopher’s stone when Harry and his friends enter the Hogwarts Castle on their way to the Gryffindor common room. The young students see a wall full of paintings that comes to life with each character moving in their frames in fine detail. Genie essentially brings still images to life, giving them a world of their own.

According to Google DeepMind, Genie can be prompted with images it has never seen. This includes real world photographs, sketches, allowing people to interact with their imagined virtual worlds. This is what is known as a foundation world model. When it comes to training, the research paper highlights that they focus more on videos of 2D platformer games and robotics. Genie is trained on a general method, allowing it to function on any type of domain, and it is scalable to even larger Internet datasets.

Why is it important?

The standout aspect of Genie is its ability to learn and reproduce controls for in-game characters exclusively from internet videos. This is noteworthy because internet videos do not have labels about the action that is performed in the video, or even which part of the image should be controlled.

Also in Explained | What is an LLM, the backbone of AI chatbots like ChatGPT, Gemini?

“Genie learns not only which parts of an observation are generally controllable, but also infers diverse latent actions that are consistent across the generated environments. Note here how the same latent actions yield similar behaviors across different prompt images,” said the blog post.

Story continues below this ad

According to Google DeepMind, the most distinct aspect of this model is that it allows you to create an entire new interactive environment from a single image. This opens up many possibilities, especially new ways to create and step into virtual worlds. To demonstrate this, the researchers created an in image using text-to-image model Imagen 2 and then used it as a prompt to create virtual worlds. The same can be done with sketches.

With Genie, anyone will be able to create their own entirely imagined virtual worlds. Besides, the model’s ability to learn and develop new world models signals a significant leap towards general AI agents (an independent programme or entity that interacts with its environments by perceiving its surroundings via sensors).

Bijin Jose

Bijin Jose serves as an Assistant Editor at Indian Express Online in New Delhi. A seasoned technology journalist with a diverse portfolio, he brings over a decade of experience in the media industry to his coverage of the evolving digital landscape and emerging technologies. Experience & Career Bijin commenced his journalistic journey in 2013 as a citizen journalist with The Times of India. His career trajectory includes significant tenures at prestigious media organizations including India Today Digital and The Economic Times. This diverse professional background, ranging from legacy print institutions to dynamic digital platforms, culminated in his current leadership role at The Indian Express, where he helps shape the publication's technology narrative. Expertise & Focus Areas Bijin has transitioned from general reporting to a specialized focus on the intersection of technology and humanity. His key areas of expertise include: Artificial Intelligence: deeply tracking developments in AI, providing nuanced perspectives on its ethical,industrial, and societal implications. Tech Commentary: moving beyond product specifications to analyze how technology reshapes daily life. Diverse Reporting Foundation: draws upon a robust background in crime reporting and cultural features to bring a human-centric approach to technical storytelling. Authoritativeness & Trust Bijin’s editorial voice is informed by a strong academic foundation, holding a Bachelor of Arts in English from Maharaja Sayajirao University, Vadodara, and a Master of Arts in English Literature. This literary background enables him to deconstruct complex technical jargon into accessible, compelling narratives. His steady progression through India’s top newsrooms underscores his reputation for editorial rigor and reliable journalism. Find all stories by Bijin Jose here ... Read More