Why you should leverage custom text-to-speech in news
Updated: Nov 29, 2021
With news all being equal more or less, success in this particular media segment primarily relies on one thing:
creating and delivering exemplary user experience.
For news publishers, (other than the quality of the journalists, obviously) the answer very well may be audio content.
In a nutshell – audio has completely changed the way people consume content these days and a lot of this shift to eardrums is largely thanks to the increasing adoption of text-to-speech (TTS) software. An embedded audio player offers an option to read each article aloud in different languages.
But what about the quality? For some listeners, the narration is simply unlistenable due to expression-free, robotic voices. As much as smart speakers have attuned our ears to synthesized voices, the fact remains that not all TTS is the same quality-wise and as ear-friendly. In fact, there’s still the stigma of technology being monotone with awkward pacing and pronunciation, to say the least.
That’s where custom text-to-speech steps in. It can emulate a human professional (and also become a brand new revenue stream for premium publishers). The quickest example of the quality I can give you is the audio version of this article which uses a Newscaster voice (more on that in a bit) called Matthew by default. It’s the end result of thousands of hours of human speech patterns that convincingly imitate the real thing.
TTS adjusted for news experience
The beauty of custom text-to-speech is that it’s specifically designed for news content. That means reading out text with a cadence, inflection, and even emotional tone of a genuine human speaker – or what one would expect from, at least.
For example, we at Trinity Audio use a news-reading voice called Newscaster which delivers authentic speech by selectively emphasizing certain words in a sentence – just like a real newscaster does. It is specially designed by AWS Polly as a long-form speaking style through the implementation of Neural Text-to-Speech (NTTS) that makes it possible to highlight particular syllables instead of pronouncing them all equally.
The advantage of NTTS is learning from training data, which results in smoother speech with no joins, along with proper rhythm and intonation of the voice based on the intended use case. The transitions between all the sounds are seamless, as opposed to the standard concatenative approach which mainly consisted of finding and stringing together phonemes (distinct units of sound in a certain language) of recorded speech to generate synthesized speech.
Here is the female version of the Newscaster voice, Joanna:
Now compare that to the standard TTS version of it:
and the neural style of the Joanna voice:
I’m positive you can tell the difference straight away.
It’s also possible to train a custom speech synthesis model based on your own audio recordings. What this does is create a unique natural-sounding voice for your company (as every voice is unique) with the option to tweak and make changes for various use cases (different news categories have different voices, for instance) without the need to record new phrases.
You can also further customize speech by tuning the pitch of the voice or through SSML tags (a markup language that allows marking up text for the generation of synthetic speech) that make possible the addition of elements such as pauses, numbers, date, time, and various other pronunciation instructions.
All things aside, custom text-to-speech arguably adds a new dimension of convenience and a one-on-one connection that only audio can boast with.
What’s in it for the publishing industry?
I suppose that’s the main question here, and the answer (in my mind) is this:
to do more on a cost-effective scale by using voice technology and the machine learning aspect of it.
Text-to-speech has radically progressed in the past few years. It’s not the same-sounding technology back when Amazon Alexa first burst onto the scene in 2014, and it’s a far cry from the Hawking-esque speech many associate it with.
These days, training voice models is easier and more accessible thanks to the advances in processing power and compression. As a result, the production costs are low, generating more ways to consume content in a faster and easier manner. The news has already been done and there is no need for production add-ons such as sound effects and music. More importantly, perhaps, is the fact that the quality is good enough that people have trouble distinguishing the difference.
We actually did a little experiment a few months ago. Eight voices (male and female) were sent to a test group. We told the people involved that some of the voices were TTS voices while some were human and that they should discern which is which. The results were are followed:
Every participant recognized two voices as software-generated.
For the next four voices, 50% said they were synthesized and 50% thought they were real.
For the last two voices, every participant identified the voices as those of a human.
What the test group didn’t know was that all of the voices were in fact synthesized text-to-speech voices. So, the experiment proved to us what we suspected: when people have no bias, they aren’t trying to find flaws that a machine got wrong and don’t even realize it’s a machine in the first place.
In terms of actual application, I’ll use McClatchy as an example. Two of the media titan’s properties, The Sacramento Bee and Raleigh News & Observer, were used as a testing ground for adding TTS-generated audio versions of news. The results were as follows:
168% increase in time spent on the news site;
89% boost in story page views;
95% increase in sessions per user.
Soon after, McClatchy rolled out an audio player (spoiler: ours) on all of their 30 news sites across the US, including iconic publications such as the Miami Herald and Spanish language version El Nuevo Herald (because custom TTS doesn’t stop at English).
Here’s the Spanish Newscaster voice, Lupe:
More than just audio stories
Naturally, the Sacramento-based media house is just one out of a group of news publications opting for text-to-speech technology. More and more media companies are relying on it to turn written articles into audio stories.
In addition, there are products from the same family tree that further enhance the user experience. Case in point: a content discovery unit that highlights bursts of audio content such as audio articles, podcasts, and radio shows across the WWW. It provides readers with personalized audio content recommendations, based on continuous learning of their behavior, which improve their overall experience.
In other words, this is content they likely want to hear based on a variety of data collected. The idea is that busy users will spend more time on the site, exploring more content. For listeners, it’s a more flexible way to get up to date on the latest news while doing something else like driving, cooking, exercising, and else.
For publishers, it’s a more alluring way to improve user experience, retain subscribers, increase readers’ (now listeners’’) time on site, grow page views, increase sessions per user, and drive new subscriptions – that is, if they’re not monetizing it straight up through automated selling and insertion of ads.
You can preview the entire ecosystem here, from the way audio ads work through the audio player to the content aggregation and recommendation unit in the upper right corner.
Those would be my arguments on why you should leverage custom text-to-speech in news, which go in line with why you should have a well-defined audio strategy in the first place.
While the TTS technology has made huge leaps forward, there’s no denying that some improvements are needed to make the speech sound even more realistic. Still, audio articles as the primary audio content in news are adopted with great success among users and are quickly becoming a brand new revenue stream, as evidenced by the offerings of major players in the publishing industry. It’s an entry point to an accessible medium that provides content consumption in the most natural and intuitive way – at a time and place listeners (former readers) choose. That kind of user experience is hard to beat.
Make sure you’re following me on Twitter for ongoing updates, tips, and industry takeaways!