Audio AI glossary – all you need to know

Updated: Nov 17, 2021

Thanks to voice assistants and the increasing adoption of text-to-speech (TTS) tech, audio has completely changed the way people consume content these days. This transition from eyeballs to eardrums wouldn’t have been possible or as successful if a human-like voice wasn’t at the center of it.

Still, this is a rapidly growing landscape that’s been driven by advancements in artificial intelligence. The technology is cool but also fairly complex under the hood, making it easy to get lost in all the fancy lingo.

This is why I’ve created a glossary related to all-things audio AI: to help you better understand what’s going on here and humanize the technology as much as possible.

accelerated mobile pages (AMP) – a web component framework that facilitates the creation of faster-loading pages through modification and optimization of nearly all of their aspects (such as an embedded audio player).

acoustic feature – characteristics of speech sounds such as color, loudness, frequency, etc. that can be analyzed, recorded, and reproduced.

acoustic model – used in automatic speech recognition, an acoustic model represents the connection between audio signals and phonemes (or other parts of language) that comprise speech. It analyzes audio recordings and their transcripts to deliver statistical representations of sounds that constitute every word.

Alexa Skills – third-party voice capabilities of Amazon Alexa that allow users to explore different functionalities, such as playing games, listening to audio content, purchasing items and services, and so on.

alt text – short for ‘alternative text’, alt text appears when an image doesn’t load correctly and is read aloud by screen-reading tools to describe an image.

Amazon Alexa – a cloud-based voice assistant available on Amazon’s and third-party voice-enabled devices.

Amazon Polly – a cloud service that uses advanced deep learning technologies to synthesize natural sounding human speech from text. Offers both standard and neural text-to-speech lifelike voices for building speech-enabled applications.

audio ads – advertisements placed inside audio content

audio article – a playable article that users can listen to; can be narrated by a human but is typically converted from text into audio using text-to-speech technology.

audio-friendly – capable of being turned into audio while sounding good. In terms of websites, an audio-friendly website is usually any website with content longer than one paragraph that can be audiofied. However, there are some minor things to avoid.

audiofy – to turn any textual content into audio.

Bixby – Samsung’s voice assistant integrated in specific models of Samsung phones.

Bixby Capsules – Samsung’s version of voice apps for its Bixby voice assistant

cadence – a rhythmic flow of voice in a sequence of sounds or words; typically refers to a gentle falling in pitch when speaking or reading, such as at the end of a declarative statement.

concatenative synthesis – a speech synthesis method that consists of finding and stringing together phonemes of recorded speech to generate synthesized speech. It strings together short speech snippets stored in an audio database, leaving just enough room for tiny but hearable variations in speech.

Cortana – Microsoft’s digital assistant that relies on the Bing search engine to perform tasks like answering users’ questions and allowing them to set reminders.

custom text-to-speech – a type of TTS which can convincingly emulate a genuine human through customization of its cadence, inflection, and emotional tone.

deep learning – a subset of machine learning that imitates the way in which the human brain gains knowledge by processing data and creating patterns for use in decision making. On a basic level, deep learning tries to simulate the natural learning process of humans via learning by examples. It has neural networks capable of learning unsupervised from data that is unstructured or unlabeled.

delivery – in terms of vocal communication, delivery comprises all components relating to voice, including articulation, pronunciation, pitch, volume, rate, fluency, and so on.

disfluency – the involuntary usage of utterances that disrupt the smooth flow or ‘fluency’ of speech, such as crutch words like ‘uh,’ ‘hum,’ ‘ah,’ and so on.

dynamic ad insertion (DAI) – a server-side ad technology that enables serving targeted audio ads into streaming audio such as live linear and on-demand audio content. This provides a more engaging and personalized listening experience.

earcon – the audio version of an ‘icon’, earcons are short, distinctive sounds that an operating system or application uses to convey certain events or other information, such as emotion or mood, during a voice-first based communication.

enunciation – the act of pronouncing words clearly and distinctly according to the rules governing a certain language.

flash briefing – a short, informative piece of pre-recorded audio in the form of a voice skill or action that is invoked with a voice command.

floating action button (FAB) – a button that triggers the primary action in an app’s UI. When pressed, it may contain more related actions.

frequency bands – an interval in any given frequency domain, delimited by a lower frequency and an upper frequency.

Google Action – Google Actions refer to apps or applets for Google Assistant-equipped smart devices and speakers, similar to Alexa Skills.

Google Assistant – Google’s voice assistant, available on a variety of mobile and smart home devices.

Google WaveNet – a group of synthesized voices based on a deep generative model of raw audio waveforms. The WaveNet model creates raw audio waveforms from scratch using a neural network that has been trained on a large volume of speech samples.

inflection – a modulation in pitch and tone patterns in one’s speech, like the natural raise in pitch at the end of a question or any variation that adds to the liveliness of the voice.

intonation – the rise and fall in pitch of the voice in speech that conveys differences of expressive meanings and attitudes such as different emotions.

JavaScript Object Notation (JSON) – lightweight data-interchange format that uses human-readable text to store and transmit data.

listen-through duration (LTD) – the average time users spent listening to audio content.

listen-through rate (LTR) – the percentage of content plays that were listened to in its entirety.

machine learning – a branch of artificial intelligence that focuses on the use of data and algorithms to emulate the way humans learn, improving its accuracy over time. Machine learning algorithms leverage structured, labeled data to make predictions as specific features are organized and defined from the input data for the model.

natural language processing (NLP) – a branch of artificial intelligence that enables computers to process human language in the form of text or spoken words like we humans do, and understand its full meaning, along with the speaker or writer’s intent and sentiment.

neural network – converts a sequence of phonemes into a sequence of spectrograms. A neural network chooses the spectrograms with frequency bands that accentuate acoustic features the human brain uses when processing speech.

neural text-to-speech (NTTS) – a type of TTS that learns from raw audio samples of actual humans speaking, allowing the creation of smoother speech with proper rhythm and intonation of the voice and no joins. As a result, NTTS takes the speech simulation a step further by producing different speaking styles for different use cases, just like humans do based on context.

neural vocoder – part of an NTTS system, a form of specific voice codec that uses deep learning networks to analyze and synthesize the human voice signal by converting the acoustic features into audible waves.

pace – the speed at which a person speaks.

phoneme – basic unit of sound that differentiates one word from another in a particular language. Sequence of phonemes defines the pronunciation of a word.

pitch – the highness and lowness of tone or voice.

player load – the number of times the audio player was loaded on embedded pages.

single-page application (SPA) – a web app implementation that loads only a single web document, then updates the body content of that single document when different content is about to be shown. As a result, the user enjoys a natural environment of the app without having to wait for the page reloads and other things.

Siri – a built-in, voice-controlled digital assistant that is part of Apple’s operating systems.

smart speaker – a voice-enabled speaker that is controlled by voice commands and is capable of streaming a variety of audio content, provide information, and control other devices through a voice assistant.

software development kit (SDK) – a set of tools that help software developers create applications for a specific platform, system, or programming language.

spectrograms – visual representations of the strength or loudness of a signal over time at a waveform’s various frequencies.

speech synthesis – artificial production of human speech based on written input.

speech-to-text (STT) – software that enables transcription of audio content into written words. Typically, STT breaks spoken words into short samples and associates those samples with simple phonemes or units of pronunciation, with complex algorithms sorting the results to try to predict what was said.

Speech Synthesis Markup Language (SSML) – a markup language that uses tags or SSML markups to define certain text elements for the generation of synthetic speech. As such, SSML represents a standard way to control specific aspects of speech such as pronunciation, volume, pitch, rate, and so on.

synthetic voice/speech – voice/speech generated through speech synthesis.

speech pattern – a characteristic way or mode of verbal expression unique to each person.

text-to-speech (TTS) – generation of synthesized speech from text where digital text is read aloud in a number of use cases. By leveraging deep learning, TTS now produces extremely natural-sounding speech that includes all the characteristics of genuine human speech such as changes to pitch, rate, pronunciation, and inflection, to name a few.

timbre – speech component that characterizes the quality of voice as distinct from its pitch and intensity. Often identified as ‘color’, it helps listeners distinguish between two or more voices which are basically unique soundwaves speakers create when speaking.

tone – a variation in the pitch of the voice while speaking that portrays the way something is said, as well as the attitude of the speaker.

vocoder – a portmanteau of voice and encoder, this is a subset of voice codecs that analyzes acoustic features and synthesizes the human voice signal as an audible waveform. The vocoder is vital to the final audio quality.

voice user interface (VUI) – an invisible interface whose technology requires voice to interact with it, designed to simulate the feeling of conversations between users and devices, and offer a screen-free way of completing tasks.

voice assistant – a software app that uses voice recognition, NLP, and voice synthesis to listen to specific voice commands and return relevant information or perform specific functions as requested by the user.

voice-first – a term coined by Brian Roemmele to label any interface designed to have voice as the primary mode of interaction, both as input and output. As such, it can be integrated with screen displays but may not be necessary for most use cases and applications.

voice style – a specifically made synthesized AI voice that is suitable for a particular type of content such as news or story-like content.

wake word – a special word or phrase that is meant to activate a voice-enabled device when spoken.

waveform – continuously varying audio signal that is represented through changes in intensity over time.

Those would be the most important terms in the world of text-to-speech you should know. To get a taste of how the technology works, give our demo a spin and hear for yourself what the fuss is all about.

