Voice assistants and the smart devices they inhabit have successfully attuned our ears to synthesized voices (for the most part). Thanks to them and the increasing adoption of text-to-speech (TTS) technologies, audio has completely changed the way people consume content these days. This shift to eardrums wouldn’t have been possible if at the center of it wasn’t a voice that genuinely sounds like a real, human voice.
But, there’s still the stigma of the technology sounding too robotic, monotone, and awkward when it comes to pacing and/or pronunciation of specific words. So, if you’re sharing some of these doubts about synthetic speech, I suggest you do these two things, for starters:
Play the audio version of this post for a quick example of the voice quality, which is the end result of thousands of hours of human speech patterns;
Set aside a few minutes to read or listen to this post and allow me to explain the wonderful science behind natural-sounding, software-generated speech.
I promise I won’t be too techy with the lingo.
How synthetic speech is made
Synthetic speech and its synthesis have a long history, none of which I’m going to bore you with. I’ll just say that it’s only in the past few years that the technology has advanced far enough to comfortably say it’s close to being completely indistinguishable from the real thing.
And it’s largely thanks to neural networks.
The majority of artificially produced speech uses concatenative synthesis, which is a method that mainly consists of finding and stringing together phonemes (distinct units of sound in a certain language) of recorded speech to generate synthesized speech. Basically, it creates raw audio data of natural speech that sounds like a human talking. Take a listen:
Now take a listen at this:
That’s the neural version of the same voice. You can (or should) hear the differences in tones and inflection, some that are subtle and some that are slightly more obvious.
This is because concatenative synthesis somewhat limits the quality of speech because of the way the waveforms are segmented. It strings together short speech snippets stored in an audio database, which leaves just enough room for tiny but hearable (at least to the trained ears) variations in speech.
On the other hand, a neural network converts a sequence of phonemes into a sequence of spectrograms – visual representations of the spectrum of frequency bands of a signal. This is important because the corresponding output takes into consideration the sequence of the elements of the input and how they work together. In other words – a neural network chooses the spectrograms with frequency bands that accentuate acoustic features the human brain uses when processing speech.
That is arguably where higher-quality, more life-like voices are made and what makes them “human”. Then, it’s just a matter of a neural vocoder (a form of voice codec that analyzes and synthesizes the signal) converting those spectrograms into speech waveforms (continuous audio signal) and that’s it. The basis for the most natural-sounding speech possible is there.
Building different speaking styles
The advantage of neural TTS (NTTS) is learning from training data – raw audio samples of actual humans speaking. This allows the creation of smoother speech with proper rhythm and intonation of the voice and no joins. As a result, NTTS takes the speech simulation a step further by producing different speaking styles for different use cases, just like we humans do based on context.
There is a multifunctional approach at work here: natural voice as the foundation that is perfectly fine as is but which can also be trained for a specific speaking style, with the variations and inflection on syllables, phonemes, and words specific to that style.
That’s how custom text-to-speech is born. For example, we use a formal news-reading voice (for this post too) which delivers authentic speech by selectively emphasizing certain words in a sentence like a real newscaster would. Here is the female version of that voice:
There is an option to further enhance and customize speech through Speech Synthesis Markup Language (SSML). This allows marking up text for the insertion of elements such as pauses, numbers, date, time, acronym pronunciations, and various other pronunciation-related instructions. On top of that, there are pronunciation lexicons that enable you to customize the pronunciation of words and really take care of the tiniest of details.
If you really want to customize the synthetic speech to your liking, it’s possible to train a custom model based on your own audio recordings. This creates a unique natural-sounding voice for your company (as every voice is unique) with the option to adjust and make changes for various use cases (different voices for different content categories and/or subcategories, for instance) without the need to record new material.
As you can see and hear, a lot of effort is invested in making a computer sound human.
On that note, I highly suggest you give BBC’s study Creating Synthetic Voices with Regional Accents a read if you’re curious about the creation of top-tier synthetic speech. It goes deep into the technical detail (it has to as it’s a very important aspect of the process) but is still a captivating read for those who aren’t technically savvy. BBC recently introduced one of the smoothest AI voices I’ve ever heard, and it’s a regional one at that, which makes it all the more fascinating. There is something to be learned here, even if you are not tech-savvy.
A few months ago, we decided to do a fun little experiment. We asked a small test group to try and identify which of the eight voices (male and female) provided were human and which weren’t. Here were the results:
Every participant recognized two voices as software-generated.
For the four voices, 50% of testers said they were synthesized and 50% thought they were real.
For the last two voices, every participant identified the voices as those of a human.
Now for the fun part: the test group didn’t know that all of the voices were actually synthesized TTS voices.
We had our suspicions before the experiment and its conclusion only proved to us that when people have no bias, they can’t really tell the difference. They aren’t trying to find flaws that a machine got wrong and don’t even realize it’s a machine in the first place.
This inevitably raises the question:
Can a synthetic voice be too real?
Unfortunately, it can. This is a subject deserving of a separate discussion because it automatically raises difficult questions about how “human” we want our artificial voices to sound. Plus, there’s the inevitable ethical aspect of people being fooled or misled through unethical use.
For now, let’s just say that AI-driven tech has advanced to the point where deep fake voices can perfectly imitate you, me, the Queen of the United Kingdom, and anyone else based on a few minutes of raw audio of our speech.
It’s a highly impressive feat and equally scary. I guess those are the dangers we’ll have to deal with in 2020s: using technology that can create a 100% unique voice for a brand but making sure it’s not being used for malicious purposes at the same time via audio watermarking and other fancy methods. The key will be to create an environment where people trust that their data will not be shared or misused.
Generating natural-sounding, human-like speech has been a goal of scientists and speech enthusiasts for decades. Besides its media implementation where it adds a new dimension of convenience (especially if you have your hands full) and forms personal connections that only audio can boast with, synthetic speech has a noble purpose.
In most non-media cases, TTS tech is used by a huge segment of the global population: visually impaired, illiterate, and individuals who lack verbal communication abilities. In these instances, the sound of the voice is extremely important (far more than text input) as it can drastically improve the quality of life for many.
Sure, not every synthesized voice is and can be the same when it comes to quality and ear-friendliness. However, I am happy to say that the voice industry is on the verge of creating speech that is completely mimicking the tone, delivery, pace, pitch, and inflection of human speech.
I see it as an entry point to a growingly accessible medium that provides content consumption in the most natural and intuitive way. Just like it should be.
Let's connect via LinkedIn!