The New Voice of Google

In the quest for the best voice synthesizer, Google claims to have built a system that rivals those using professional voice artists. Called Tacotron 2, this latest text-to-speech system “learns” aspects of speech simply from recordings and transcripts. It can then generate sounds from text entirely from scratch, even if it has never encountered some of the words before.

The neural network algorithm considers several characteristics of human speech, including punctuation, intonation and a feature called prosody, which can best be described as the “tune” of the voice. Unlike purely mechanical qualities, prosody reflects qualities like emotion, sarcasm, emphasis and contrast.

Although it’s a big improvement over the artificial- sounding robotic voices we’ve come to know (if not love), the developers admit that it still has its drawbacks, stumbling over complex words and periodically producing strange random noises. It also cannot yet be used to generate real-time audio or be programmed to sound happy or sad. Regardless, it’s safe to say that we’re entering a new realm of speech synthesis where the quality may soon be indistinguishable from human speech.