It rained AI tools in 2022, and new ones are still popping up, astonishing people with their ability to create eerily convincing essays, artwork, and videos with nothing but a text prompt. But while AI that can generate text, images, and video has been in the limelight for a while, it’s almost as if speech hasn’t been shown much love.

Microsoft’s changing that with Vall-E, a new AI Text To Speech system (TTS) that can take a three-second recording of someone’s voice and replicate it, turning written words into speech. The idea isn’t novel, of course – what’s new is how frighteningly good the AI is at convincing us that the output is from an actual human being – when it’s actually not. Let’s take a look at how Vall-E is able to work its magic, its capabilities, and its use cases.

But first, what is Vall-E?

Microsoft likes to call Vall-E a “neural codec language model.” It uses a different approach than other voice generators that came before it, helping it achieve far higher accuracy. One of these is the fact that the TTS training data was scaled up to 60,000 hours of English speech, which Microsoft claims is hundreds of times larger than existing systems. This allows the TTS system to produce “high-quality personalised speech” with nothing but a 3-second recording of any person as an “acoustic prompt.”

Despite the similar sounding names, Vall-E does not seem to have anything to do with Dall-E, the deep learning model developed by OpenAI to generate images from natural language descriptions.

How Vall-E stands out

The larger training data, as mentioned above, and other new methods mean Vall-E uses a different approach to other TTS systems. Microsoft claims that this has allowed it to “significantly outperform” other products in its category in terms of speech naturalness and speaker similarity. Vall-E has also been created to deliver in “zero-shot situations,” meaning it does not require prior examples or training in a specific context – you feed it a 3-second audio clip and a text prompt, and that’s it.

How Vall-E works (Image: Microsoft) How Vall-E works (Image: Microsoft)

But perhaps the coolest bit is Vall-E’s ability to preserve the speaker’s emotion. Microsoft has demonstrated this ability on the GitHub page for the TTS system. The 3-second audio can be uttered in any tone – angry, sleepy, neutral, amused, disgusted, etc. – and Vall-E will recite any text while preserving that tone.

Potential uses

One of the most obvious uses for this technology is giving a voice to the mute who lost their ability to speak at some point in their life. Even super short recordings with a subject’s voice in them can be used to rebuild an extremely natural-sounding artificial voice. It can also be used by those with speech disabilities – they can type what they want to say and Vall-E can convert that into speech.

Concerns

AI is seldom without concerns, and it is only probable that Vall-E would come with its unique share of apprehensions. Microsoft acknowledges these in its paper for the system, saying that Vall-E carries the potential for misuse, such as spoofing voice identification or impersonating a specific speaker. We’ve already seen Deepfakes spread misinformation and inspire misunderstanding by creating false narratives of people, so it’d be interesting to see how things with Vall-E play out if and when it’s made public.