Google’s DeepMind company, which created the AlphaGO program that beat a human world champion at the ancient Chinese game of Go, has now made a breakthrough in speech generation for machines. DeepMind’s latest research paper shows their new system called WaveNet is 50 per cent better than existing technology, including Google’s own current text-to-speech systems (TTS).
According to a blogpost, WaveNet is “a deep generative model of raw audio waveforms,” and can generate speech which copies “human voice,” which sounds much more natural than any existing model. The post adds the same network can even be used to create original pieces of music, straight from the machine itself.
DeepMind was bought by Google in 2014, and has been working on AI, neural networks and improving machine learning. In the last couple of years, we’ve seen computers improve on how well they can understand human speech. For instance, we can now interact with Google Now voice assistant for direct requests with sentences in natural, spoken language.
But with WaveNet the idea is to generate speech from machines which sounds more ‘human-like’. DeepMind’s blog explains that such a process called “speech synthesis or text-to-speech (TTS)” is still largely dependent on a database where “short speech fragments are recorded from a single speaker and then recombined to form complete utterances.” The challenge with this is that voice can’t be modified.
With WaveNet, the idea is to directly “model the raw waveform of the audio signal, one sample at a time,” in order to make it more natural-sounding speech. DeepMind’s blogpost outlines how they created this system… “the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances.” So the system is taught to generate speech on its own.
DeepMind also notes the system is more expensive than any of the existing ones, but it helps create a more natural sounding audio.
During DeepMind’s tests for US English and Mandarin, speech generated by WaveNet is much more natural-sounding than any of the earlier system, including Google’s own current system, which is considered a leading one in the world. The blogpost also says WaveNet can even generate breathing and mouth movements, and identify characteristics of different voices, including male and female, something current systems can’t do so well.
The challenge with WaveNet is that it requires a lot of computation power for now. DeepMind’s other big achievement has been creating AlphaGo, which was able to beat a human world champion at the game of GO. The success in GO is harder to achieve for a computer because the game relies on intuition something which can’t be taught to a machine so easily.