top of page
  • Writer's pictureRon Jaworski

The challenge of audio deepfakes in digital publishing

“Breakthrough technologies always have unintended consequences.”

Sir Ken Robinson.

I’ll be the first to sing praises of AI-generated voices and what they bring to the media ecosystem, but I can’t ignore there’s a dark side to it.

Deep neural speech synthesis is also used to generate audio in the voice of different speakers, without their consent. Technology has advanced to the point where fake voices can convincingly imitate you, me, and anyone else, provided there are a few seconds of raw audio of our speech readily available.

Yes, you read that correctly.


It breaks any voice down into syllables or short sounds before rearranging them to create new sentences.

What I find is even worse is the fact that some of these are open-source and freely available on sites like Github. Case in point - SV2TTS, which you can hear in action below:

It’s a highly impressive feat and equally scary because even though the tone, inflection, accent, and such aren’t 100% in some cases, they don’t really need to be.

What you just heard is more than good enough to fool both the imitated person and voice-first systems where the deepfake can be applied. An odd trill in the tone or sporadic lack of emotion and emphasis in certain lines doesn’t change the fact that what you hear is unmistakably you.

Plus, it’s easy to explain away any imperfection or glitch as a result of a poor connection, loud environment, and so on.

How potent is this tech, for better or worse?

Oh, it’s potent.

The biggest issue here is that the technical requirements are no longer an obstacle for anyone who wants to create deepfakes. AI speech synthesis is fast, relatively easy, and sounds as close to the real deal as possible.

Freely available voice-mimicking software can deceive people and voice-activated tools like smart assistants, according to University of Chicago scientists.

The researchers used two deepfake voice synthesis systems from GitHub to mimic voices. The AutoVC tool requires up to five minutes of speech to generate a passable mimic, while the SV2TTS system needs just five seconds.

The researchers employed the software to unlock speaker recognition security systems used by Microsoft Azure, WeChat, and Amazon's Alexa system.

AutoVC fooled Azure about 15% of the time, compared to SV2TTS's 30%, and SV2TTS could spoof at least one of 10 common user-authentication trigger phrases Azure requires for 62.5% of the people the team tried.

SV2TTS further fooled both WeChat and Alexa about 63% of the time. Also, deepfakes more successfully spoofed women's and non-native English speakers' voices, and also tricked 200 people into thinking they were real about half the time.

If you need further convincing, check out the Vocal Synthesis YouTube channel, which largely leverages Google’s tacotron2, a neural network architecture for speech synthesis directly from text. Here it is used for lighthearted stuff:

It’s not like it’s easy to detect that it's not Chappelle, too.

Another research found out that while an AI-based system generally outperforms humans in detecting audio deepfakes, it still can be fooled. Plus, humans are more accurate when it comes to certain attack types. It’s crucial to combine human and machine knowledge to improve audio deepfake detection, meaning it’s that difficult to distinguish what’s real and what isn’t.

Why audio deepfakes can be particularly convincing

Unlike its video counterparts, audio deepfakes have a single point of impact:

our ears.

It’s a similar, yet different procedure because it’s based on the same principles of computation with neural networks, primarily differing in the way voice material is processed.

So, it stands to reason that it’s easier to manipulate people into believing what they’re hearing is real than it is to make them believe what they’re seeing is genuine too. Simple math: it’s more credible to believe something that was said in a dark room when cameras weren’t rolling.

But that’s only one, arguably less important reason why synthetically generated audio content is dangerous.

Much of the reason lies in the medium itself.

Audio is generally perceived as the most trusted and memorable media format that forges strong connections with listeners. It drives people to take action, regardless of the platform it delivers its message on.

Speaking of platforms, those of the audio variety are trusted more than other media sources, especially when it comes to local content and hosts where trust is the most influential driver of listening.

In the USA, in particular, nearly two-thirds say radio is either “very trustworthy” or “trustworthy”, topping every other media type other than a newspaper - and even then it’s a mere one percent difference.

Being trustworthy is not just among the super fans of other media, but for the average Joe as well. Regardless of how much additional media people consume, data is clear:

heavy users of TV, the internet, social media, and magazines all consider radio to be more trustworthy than the respective media they consume heavily.

Almost by default, audio content has a stronger pull than any other in terms of emotions.

What’s the first thing that comes to mind for how deepfake technology can be abused in the publishing landscape?

If your answer is fake news, then you and I think the same.

The good ol’ fact-checking methods just don’t cut it any longer as the entire concept of fake news is being stepped up and publishers need to adjust to this new reality.

There are plenty of possibilities when it comes to manipulating and exploiting media to get a specific message out in the world. Political misinformation seems like one of the obvious “use cases” that potentially poses the biggest threat as news does much more than just convey information.

Here is the former leader of the free world doing a Star Wars bit:

Funny stuff, but deepfakes can also easily be used for far more sinister stuff than a few good laughs.

Let’s not forget that fake news has a far stronger impact as opposed to regular news, with more layers to its impact due to its shocking and attention-grabbing nature.

I’m not naive - I know some publishers don’t care about journalism. They care about eyeballs. But when that’s the case, deepfake is merely a symptom, perhaps even good news to such publishers.

I’m talking about the moral ground and position in this post, and raising the flag of danger, for those publishers who care about the authenticity of what they publish and don’t want to distribute fake news.

So, what can publishers do?

Publishers whose bread and butter are news and investigative journalism must incorporate more fact-checking in their efforts to make sure the content they're sending out is real, whether it’s an anonymous tip, something picked up from Twitter, or else.

Here are several examples that are already being implemented across prestigious publications:

  • Creating your fact-checking units

  • Routinely asking for feedback from your audiences on your coverage and direction

  • Leveraging resources like PolitiFact, Media Bias/FactCheck, and similar

  • Arming yourself with tools

The tools thing calls for elaboration as it’s a new front, so I’ll dedicate an entire section to it:

Tools you can use in your fight against audio deepfakes

Recently, researchers developed a deepfake audio detection method to successfully spot increasingly realistic audio deepfakes.

Out of close to 200 hours and 118,000 samples of synthesized voice recordings in both English and Japanese, scientists at the Horst Gortz Institute for IT Security at Ruhr-Universitat Bochum managed to find “subtle differences in the high frequencies between real and fake files,” which were enough to determine a real from a fake file.

Their novel software is only the beginning, as they state that “these algorithms are designed as a starting point for other researchers to develop novel detection methods.”

Another techy solution can be found in blockchain and other distributed ledger technologies that can be used for finding data provenance and tracing information. I find that a neat way of verifying the authenticity of digital files.

There is no shortage of audio spoofing detection methods, from uncovering cues that enable detection in the high-frequency band of the recordings to data augmentation methods and affective computing. Bottom line: researchers are working on the problem from multiple angles, and that’s good.

The regulators are also stepping in to help.

The United States has passed a bill called in 2019 which regulates the use of deepfakes with irremovable digital watermarks. On top of that, a similar act in California criminalizes the use of deepfakes in political campaign promotions and advertising.

That’s something, right?

Final thoughts

As numerous examples showed, deepfakes are already in reach for anyone who wants to cause trouble on the internet.

Those are the dangers we’ll simply have to learn how to deal with in this decade:

how to leverage and enjoy the power of audio, but at the same time, make sure it isn’t being used for fraud at our expense.

As audiences, we need to be critical consumers of everything we see, hear, and read, which would be the social part of the larger equation in this fight.

From the publishing side, the key will be to be aware and step up fact-checking, in the endless pursuit to create an environment where audiences trust in you.

Solving the problem of deepfakes will require more attention and some form of political input, as well as social actions that will supplement the technology.

So far, things are looking promising.

I’m optimistic.

bottom of page