VideoPoet combines multiple video generation capabilities into a unified language model. (Image: Google)Just when Midjourney and Dall-E 3 were making remarkable progress in text-to-image, Google introduced a new large language model (LLM) that is multimodal and generates videos. This model comes with video generation capabilities that have never been seen before on LLMs.
Scientists at Google have introduced VideoPoet, which they claim to be a robust LLM that is capable of processing multimodal inputs such as text, images, video, and audio to generate videos. VideoPoet has deployed a ‘decoder-only architecture’ that enables it to produce content for tasks that it has not been specifically trained on. Reportedly, the training of VideoPoet involves two steps similar to LLMs – pretraining and task-specific adaptation. According to the researchers, the pre-trained LLM is essentially the base framework that can be customised for various video generation tasks.
“VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator,” reads a post on the website.
What makes VideoPoet different?
When compared to prevailing video models that use diffusion models that add noise to training data and eventually recreate it, VideoPoet combines multiple video generation capabilities into a unified language model. While other models have separately trained components for different tasks, VideoPoet has everything integrated into a single LLM.
The scientists claim that the model excels in text-to-video, image-to-video, video inpainting and outpainting, video stylisation, and video-to-audio generation. This model is known as an autoregressive model, which means it creates output by taking cues from what it generated previously. It has been trained on video, audio, image, and text with tokenisers to convert the input to construct different modalities. In the realm of AI, tokenisation is essentially the process of converting input text into smaller units also known as tokens which could be words or subwords. This is critical for Natural Language Processing as it enables AI to comprehend and analyse human language.
According to the scientists, the results are a testament to the promising potential of LLMs in the domain of video generation. They believe that their framework would support ‘any-to-any’ format in the future.
Interestingly, VideoPoet can also create a short film by combining numerous video clips. The researchers asked Google Bard to write a short screenplay with prompts. Later they created a video from the prompts and assembled everything to create a short film.
However, it is not equipped to produce longer videos. Google has said that VideoPoet can overcome this limitation by conditioning the last second of videos to predict the next second. The model is also capable of taking existing videos and changing how the objects in it move, this can be best explained with the example of the Mona Lisa yawning.