Deep Voice 3 teaches machines to speak by imitating thousands of human voices from people across the globe

By Jean-Jacques DeLisle,
contributing writer




Baidu’s Deep Voice 3
software can clone anyone’s voice. Image source: Pixabay.


breakthrough in digital voice emulation technology was recently released by
Chinese Google equivalent, Baidu
. Baidu claims its new text-to-speech
system (TTS), known as Deep Voice 3, can learn to accurately replicate any
human voice using less than one minute of audio. This advancement comes in the
midst of a tech race to achieve more reliable TTS emulation software, with
heavy hitters like Google already in the running with their “wavenet” TTS
project. Adobe is also in the race, having recently unveiled its prototype TTS
software “Project VOCO,” which can learn to mimic a voice in 20 minutes.


Baidu’s researchers used a different approach when confronting the
text-to-speech dilemma and introduced something unique. The team implemented
two different approaches into its design: speaker adaptation, and speaker
encoding. The two can work in different ways for different devices or can be
used together, but the bottom line is they get the job done faster than the


adaptation works by a background propagation-based approach grounded in the
multi-speaker generative model only to low dimensional speaker embeddings. In
other words, the program will form a model based on the sound of your voice and
then run text-to-speech software throughout that model, simulating with
relative accuracy at least the frequency and tone of your voice. This could be
used with more simple devices and other programs that will allow you to set
your iHome or Siri to a custom voice.


encoding works differently and combines the multi-speaker generative model with
a separate model that generates a new speaker embedded from cloned audio. This approach
dramatically reduces cloning times to just a few second intervals and has very
few working parameters, meaning it could be achieved relatively cheaply and
then easily deployed to existing devices. Such a form of voice simulation can
replicate accents, tones, and subtle nuances in speech creating a very
convincing replication.


what are the implications of this kind of voice cloning? Baidu hopes it will be
useful for all manner of devices, such as iHome or Siri, smartphones, GPSes,
and more. Being able to hear the voice of a loved one, or even yourself, guide
you through traffic would be much more pleasing to your ears than that of the
computerized voice we might hear now. But are the applications really that
innocent? Wouldn’t this technology significantly lower the effectiveness of
voice verification security? Could celebrities or politicians have their voices
“stolen” and then used for malicious broadcasts or spreading misinformation?
Could someone steal your voice and use it to threaten someone or commit some
other crime in your name? For every new technology we create there’s a positive
and a negative application, and this new TTS technology is no different.