The Innovators: AI voice technology for broadcast: A professional audio perspective

THE HAGUE, Netherlands — In the diverse world of artificial intelligence voice technology, much of the innovation is tailored toward e-learning platforms and social media content, leaving a gap in solutions specifically designed for high broadcast production standards.

This guide aims to bridge that gap for professionals in the audio industry, offering insights into the crucial aspects of AI voice technology that must be scrutinized when selecting a system for broadcast use. It highlights the need for pro audio users to delve deeper into the capabilities of AI voice solutions, ensuring that the technology they choose can rise to the challenges of professional broadcasting.

Sample rate: Pro audio standards and spectrum analysis

The sample rate in AI voice technology is critical for sound quality, especially in professional audio. Most AI voice technologies typically use 16 kHz or 24 kHz, but pro audio standards require 48 kHz. While some systems may produce 48 kHz files, the underlying technology may not. The internal sample rate can be found by analyzing the audio spectrum, where lower rates will show cutoff frequencies at 8 kHz or 12 kHz.

SSML support: Tailoring AI voice for pro audio

Speech Synthesis Markup Language is a crucial tool in professional audio for refining style, delivery and pronunciation. Many text-to-speech providers do not support this feature.

SSML features the following:

Prosody: Fine-tunes pitch, rate and volume for emotional depth and content emphasis.
Pause: Strategically placed pauses to enhance speech rhythm and realism.
Emphasis: Strengthens expression by highlighting keywords or phrases.
Phoneme: Precisely dictates pronunciation, ensuring accuracy for complex terms.
Say-as: Instructions on the vocal interpretation of specialized content.
Voice and language: Adapts voice traits and supports multilingual projects — essential for diverse pro audio applications.
Multistyle voice generation.
AI voice systems with multistyle generation capabilities can produce various speech styles and emotions, such as cheerful, serious, formal, informal and enthusiastic. The degree of these styles can be adjusted, offering nuanced variations in tone and enhancing the versatility of the voice output.

Wedel Software and Adthos CEO Raoul Wedel

Lexicon: The foundation for accurate pronunciation

A comprehensive lexicon is crucial in AI voice technology for resolving pronunciation issues. The lexicon serves as a reference guide for the AI, ensuring it pronounces words correctly, especially those that are uncommon, technical or borrowed from other languages.

The need for a lexicon arises from the inherent limitations of text-to-speech systems in accurately predicting the pronunciation of every word.

Words with nonstandard pronunciations, industry-specific jargon, names and borrowed words often pose challenges. A well-developed lexicon addresses these challenges by providing specific pronunciation guides for these words.

When pronunciation errors are identified, corrections are applied directly to the lexicon.
This involves specifying the phonetic representation of the word or phrase according to the International Phonetic Alphabet or a similar phonetic system.

Once updated, the AI model references this lexicon to produce the correct pronunciation.

Voice conversion for targeted voice performances

Voice conversion technology in AI is particularly valuable for specific voice performance requirements, like in advertising. It enables the transformation of one voice to another while maintaining a speaker’s accent and pronunciation.

Fixed-length voice outputs in advertising

AI voice technology can produce speech within exact time constraints, which is essential for advertising where fixed-length ads are crucial.

A comprehensive lexicon is crucial in AI voice technology for resolving pronunciation issues. The lexicon serves as a reference guide for the AI, ensuring it pronounces words correctly, especially those that are uncommon, technical or borrowed from other languages.

Training costs vs. operational expenses

The economics of AI voice technology can vary significantly. Some systems may be cheaper to train but come at a higher cost per character. This is because the initial training cost, while potentially lower, doesn’t always translate to lower operational costs. The cost per character generated depends on various factors, including the sophistication of the technology, the quality of the output and the efficiency of the AI algorithms.

Systems with lower training costs might use less sophisticated models, resulting in higher per-character costs due to less efficient processing or the need for more post-processing to reach a desired quality level. Conversely, systems with higher initial training costs often use more advanced models, leading to lower costs per character due to more efficient processing and higher-quality outputs that require less editing.

Audio processing in AI voice technology

Audio processing is crucial for broadcast-quality voice content, involving level adjustments, equalization and compression to meet broadcasting standards.

However, using processed speech to train AI voice models can introduce artifacts, affecting the model’s performance and resulting in a less natural AI voice. It’s advisable to apply audio processing post-text-to-speech creation, ensuring the model trains on clean speech while the final output benefits from enhanced audio quality, maintaining the fidelity essential for broadcast standards.

Summary

AI voice technology, with its diverse capabilities like SSML support, lexicon accuracy, multistyle voice generation and voice conversion, offers transformative potential for digital voice interaction.

nderstanding these aspects, including the importance of audio processing and economic considerations, is crucial for effectively using this technology.

The author is CEO of Wedel Software and Adthos.

Spotwise launches aOS for autonomous media workflows

Pattison Media adds Quu visual messaging to 39 stations

Good to Great Lesson #21: Make something worth sharing

Australian podcast audience tops 4.6 million for news in June

French radio alliance backs broadcasters covering wildfires

KOPN picks BE STX transmitter

Sample rate: Pro audio standards and spectrum analysis

SSML support: Tailoring AI voice for pro audio

Lexicon: The foundation for accurate pronunciation

Voice conversion for targeted voice performances

Fixed-length voice outputs in advertising

Training costs vs. operational expenses

Audio processing in AI voice technology

Summary

More stories about AI

Raoul Wedel

Spotwise launches aOS for autonomous media workflows

Pattison Media adds Quu visual messaging to 39 stations

Good to Great Lesson #21: Make something worth sharing

Telos Alliance adds AI-powered source separation to AudioTools Server

Italy opens applications for AI, 5G Broadcast and DAB investment funding

Tech Focus: Spotwise adds AI to broadcast ad sales

About

Advertising

Useful Links

more

latest news

Spotwise launches aOS for autonomous media workflows

Pattison Media adds Quu visual messaging to

Good to Great Lesson #21: Make something

Australian podcast audience tops 4.6 million for

French radio alliance backs broadcasters covering wildfires

Follow us: