Meta unveils Voicebox, a generative AI model for speech generation

Meta claims that it has achieved a breakthrough in generative AI for speech with Voicebox.

While the usual generative AI models produce pictures from text prompts, Voicebox produces high-quality audio clips. (Image: Meta)

Listen to this article

00:00

1x 1.5x 1.8x

Generative AI is making strides in various aspects of content creation. Now, tech behemoth Meta has deployed generative AI models for speech-related tasks. The company has unveiled Voicebox, a super tool that can help with audio editing, sampling and styling. Voicebox is the kind of technology that can aid content creators with an array of tasks, assist the visually impaired in hearing written messages, and enable people to speak in any foreign languages.

The company has claimed it has achieved a breakthrough in generative AI for speech. “We’ve developed Voicebox, the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance,” the company wrote in its blog.

Also Read | Meta verified comes to India: This is how much it will cost

Voicebox creates outputs in a variety of styles and it can create them from scratch. While the usual generative AI models produce pictures from text prompts, Voicebox produces high-quality audio clips. Presently, the model can process speech in six languages and perform tasks such as noise removal, content editing, diverse sample generation and style conversion.

Story continues below this ad

Meta also said that its multipurpose generative AI models like Voicebox could render natural-sounding voices to virtual assistants and NPCs in the metaverse. The model comes with in-context text-to-speech synthesis that allows Voicebox to match audio style for text-to-speech generation from an audio sample as short as two seconds.

Also Read | Facebook owner Meta slashes business teams in final round of layoffs

The model can recreate a portion of the speech that has been interrupted by noise or replace misspoken words without the need to re-record the speech again. Voicebox can produce speech from text in French, Spanish, English, German, Polish and Portuguese from a sample of a person’s voice. This capability is known as cross-lingual style transfer. “This capability could be used in the future to help people communicate in a natural, authentic way even if they don’t speak the same languages.”

Besides, the tool with its diverse speech sampling can generate speech that reflects how people talk in the real world.