The birth of speech synthesis is here

Text to speech has phenomenal application across industry sectors.

Turns out Baidu has been working on other projects alongside its self-driving car project at its AI center in Silicon Valley.

As reported by the MIT’s Technology review, the company has been busy creating a text-to-speech system dubbed as Deep Voice which it claims is more efficient than Google’s WaveNet.

According to Baidu, Deep Voice can be trained to speak in just a few hours with very little human interaction. It can also convey different emotions in its speech, and Baidu claims that it can quickly synthesize speech that sounds pretty realistic and very natural.

Google’s WaveNet also has the same capability, but in comparison to Deep Voice, it’s more computationally demanding and harder to use in real-world applications.

Baidu claims it has overcome issues surrounding WaveNet by using deep-learning techniques that convert text to phenomes, the smallest unit of speech. It then converts those phonemes into sounds using its speech synthesis network.

For instance, Deep Voice converts the word “hello,” into “(silence HH), (HH, EH), (EH, L), (L, OW), (OW, silence)” before its speech network goes about pronounces it.

While Google and Baidu use deep learning techniques which don’t need any human input, Baidu’s method uses phonemes or syllables to stress out pronunciations to place an element of emotion into them.

“To perform inference at real-time, we must take great care to never recompute any results, store the entire model in the processor cache (as opposed to main memory), and optimally utilize the available computational units.”

The research by Baidu’s and Google’s researchers show that real-time speech synthesis is on the anvil.


