THE HAGUE, Netherlands — In the diverse world of artificial intelligence voice technology, much of the innovation is tailored toward e-learning platforms and social media content, leaving a gap in solutions specifically designed for high broadcast production standards.
This guide aims to bridge that gap for professionals in the audio industry, offering insights into the crucial aspects of AI voice technology that must be scrutinized when selecting a system for broadcast use. It highlights the need for pro audio users to delve deeper into the capabilities of AI voice solutions, ensuring that the technology they choose can rise to the challenges of professional broadcasting.
Sample rate: Pro audio standards and spectrum analysis
The sample rate in AI voice technology is critical for sound quality, especially in professional audio. Most AI voice technologies typically use 16 kHz or 24 kHz, but pro audio standards require 48 kHz. While some systems may produce 48 kHz files, the underlying technology may not. The internal sample rate can be found by analyzing the audio spectrum, where lower rates will show cutoff frequencies at 8 kHz or 12 kHz.
SSML support: Tailoring AI voice for pro audio
Speech Synthesis Markup Language is a crucial tool in professional audio for refining style, delivery and pronunciation. Many text-to-speech providers do not support this feature.
SSML features the following:
- Prosody: Fine-tunes pitch, rate and volume for emotional depth and content emphasis.
- Pause: Strategically placed pauses to enhance speech rhythm and realism.
- Emphasis: Strengthens expression by highlighting keywords or phrases.
- Phoneme: Precisely dictates pronunciation, ensuring accuracy for complex terms.
- Say-as: Instructions on the vocal interpretation of specialized content.
- Voice and language: Adapts voice traits and supports multilingual projects — essential for diverse pro audio applications.
- Multistyle voice generation.
- AI voice systems with multistyle generation capabilities can produce various speech styles and emotions, such as cheerful, serious, formal, informal and enthusiastic. The degree of these styles can be adjusted, offering nuanced variations in tone and enhancing the versatility of the voice output.
Lexicon: The foundation for accurate pronunciation
A comprehensive lexicon is crucial in AI voice technology for resolving pronunciation issues. The lexicon serves as a reference guide for the AI, ensuring it pronounces words correctly, especially those that are uncommon, technical or borrowed from other languages.
The need for a lexicon arises from the inherent limitations of text-to-speech systems in accurately predicting the pronunciation of every word.
Words with nonstandard pronunciations, industry-specific jargon, names and borrowed words often pose challenges. A well-developed lexicon addresses these challenges by providing specific pronunciation guides for these words.
When pronunciation errors are identified, corrections are applied directly to the lexicon.
This involves specifying the phonetic representation of the word or phrase according to the International Phonetic Alphabet or a similar phonetic system.
Once updated, the AI model references this lexicon to produce the correct pronunciation.
Voice conversion for targeted voice performances
Voice conversion technology in AI is particularly valuable for specific voice performance requirements, like in advertising. It enables the transformation of one voice to another while maintaining a speaker’s accent and pronunciation.
Fixed-length voice outputs in advertising
AI voice technology can produce speech within exact time constraints, which is essential for advertising where fixed-length ads are crucial.
Training costs vs. operational expenses
The economics of AI voice technology can vary significantly. Some systems may be cheaper to train but come at a higher cost per character. This is because the initial training cost, while potentially lower, doesn’t always translate to lower operational costs. The cost per character generated depends on various factors, including the sophistication of the technology, the quality of the output and the efficiency of the AI algorithms.
Systems with lower training costs might use less sophisticated models, resulting in higher per-character costs due to less efficient processing or the need for more post-processing to reach a desired quality level. Conversely, systems with higher initial training costs often use more advanced models, leading to lower costs per character due to more efficient processing and higher-quality outputs that require less editing.
Audio processing in AI voice technology
Audio processing is crucial for broadcast-quality voice content, involving level adjustments, equalization and compression to meet broadcasting standards.
However, using processed speech to train AI voice models can introduce artifacts, affecting the model’s performance and resulting in a less natural AI voice. It’s advisable to apply audio processing post-text-to-speech creation, ensuring the model trains on clean speech while the final output benefits from enhanced audio quality, maintaining the fidelity essential for broadcast standards.
Summary
AI voice technology, with its diverse capabilities like SSML support, lexicon accuracy, multistyle voice generation and voice conversion, offers transformative potential for digital voice interaction.
nderstanding these aspects, including the importance of audio processing and economic considerations, is crucial for effectively using this technology.
The author is CEO of Wedel Software and Adthos.