RIO DE JANEIRO — It began like many good stories: Two colleagues with entirely different backgrounds, fired by a common, genuine passion for hands-on research, unexpectedly collaborating on an abstract concept. The one, Pedro Leite, machine learning engineer and AI researcher at Grupo Globo, has spent years exploring generative audio systems. The other, Luiz Fernando Kruszielski, is an innovation technologies specialist at Globo and a veteran sound engineer with a reputation for a “golden ear” — the ability to instinctively hear what others miss.
Kruszielski and Leite were intrigued by early AI technology that enabled singers to create different voices and wondered whether it could produce something suitable for radio broadcasting.
The AI speech transformation process changes a performer’s voice into a target voice using information about timbre, pitch and spectrum — a sort of “voice DNA” — from a voice model. Emotions come from the performer’s voice — the source. This way, “the source plays with the target (voice), and the result is a very reliable voice,” Kruszielski said. “In broadcast applications, listeners shouldn’t be able to perceive the resultant voice as something unnatural.”
Not all sounds have a fundamental frequency, but in voiced speech, the vocal folds generate a fundamental frequency — the basic rate at which they vibrate — and this determines the perceived pitch of the voice. The harmonic structure built on this base frequency shapes the timbre, the tonal quality that makes one voice sound different from another. Because every speaker has a unique combination of fundamental frequency and harmonic patterns, AI-assisted speech-to-speech systems must analyze these elements to recreate a speaker’s characteristic timbre while preserving the timing, rhythm, melody and emotional contours of the original performance.
Perhaps the most impactful application for high-volume productions is converting low-quality recordings into studio-grade audio.
Transforming into a wolf
In the beginning, Kruszielski and Leite explored the technology simply out of interest, running quick tests on open source models with hardware available in their lab. It was interesting but not obviously useful. The time to push their research a step further came when they encountered a particular creative challenge. A key scene of a Globo drama series “Vermelho Sangue” (“Blood Red”) required a girl to transform into a werewolf, and her voice had to transform accordingly. The writing team wanted accuracy — not a movie-style monster or a generic animal sound, but the vocalization of a specific species. They tried every traditional technique, including layering recordings, shifting pitch and formants (the resonant frequencies of the vocal tract that shape the characteristic timbre of a voice or vowel sound, independent of pitch), blending organic and synthetic tones, and using early voice-to-voice models. Yet nothing felt authentic enough.

Kruszielski and Leite realized that if they wanted authenticity, they had to start with an authentic source. It would become the defining insight of the entire project. That meant real wolf vocalizations — scientifically documented recordings.
So, their next meeting was with a wildlife biologist specializing in animal vocalizations who had an extensive archive of wolf recordings. Unfortunately, the first outputs based on those samples sounded synthetic and unrealistic. Kruszielski and Leite didn’t give up. The field recordings included layers of environmental noise, such as wind, rustling vegetation and distant birds. To train a model capable of producing realistic transformations, they needed isolated vocalizations. So, they carefully cleaned each file, separating harmonic content from wildlife ambience and removing contamination without damaging the integrity of the wolf’s “voice.”
As soon as the refined dataset was fed into the AI model, everything changed. The voice of the transformed girl carried the texture, tension and resonance of a real animal. When they played the result for the director, he stood up from his chair. AI speech-to-speech was no longer an experiment. It was ready for production.
Built for sound engineers
The backbone of the system Kruszielski and Leite designed is the open-source RVC Project AI algorithm. Although powerful and flexible, it is not intended for everyday workflows in a sound department. It required command-line interfaces, cryptic flags, hidden configuration files and robust IT skills.

Photo: Carlos Eduardo Rocha Miranda
Kruszielski and Leite designed a custom GUI specifically for audio specialists, with familiar controls. By reframing the AI system as studio-grade software rather than a technical experiment, they made it accessible to colleagues across the production team. The entire processing runs on a consumer-grade, gamer-level graphics card from the Nvidia RTX 40 family, which retails at a price well within the reach of any production studio and capable of faster-than-real-time processing. What started as a two-person side project became part of Globo’s broader audio production workflow.
The team applied lessons learned from the wolf transformation to everyday production challenges, such as correcting minor dialog mistakes without recalling actors for costly retakes. They have also used the system to modify accents. In one case, an actress needed to perform with a Yiddish accent that proved challenging during the shoot. With a reference sample and a carefully tuned AI audio model, they were able to shift her accent in post while preserving the shape, emotion and timing of her original performance. The result was seamless and expressive.
The risk of collapsing an illusion
Perhaps the most impactful application for high-volume productions is converting low-quality recordings into studio-grade audio. Actors or talents who are traveling or unavailable for booth time can record a line on a phone, in a hotel room, or anywhere convenient. The AI system reperforms the lines using the actor’s or talent’s vocal identity, producing audio that sounds as if it were recorded in ideal studio conditions.
For teams producing large volumes of scripted content, this flexibility can be transformative, improving the quality of everyday productions while saving time.
Kruszielski believes the technology does not support the idea AI might soon replace actors wholesale. While AI models can reproduce tone, timbre and certain expressive gestures, they cannot comprehend the subtle patterns that make a human performance unique. “An actor is defined not just by the sound of their voice but by their micropauses, breathing rhythm, nuanced hesitations, emotional timing and the way tension rises and releases across a line,” he explained.
Current AI models can approximate fragments of this but not sustain these characteristics across a long monolog without drifting into something increasingly unnatural. “As soon as a listener senses that something is off, the immersive experience breaks. The illusion collapses,” Kruszielski warned. For that reason, the team insists the technology is best understood not as a replacement for performers but as a tool that enhances their work, offers flexibility, and preserves creative intent.
After receiving a Master of Science in Engineering, the author worked for Telecom Italia and the Italian public broadcaster, Rai. Based in Bergamo, Italy, he now spends his time as a broadcast consultant for radio stations and equipment manufacturers, specializing in project management, network design and field measurement.
This article first appeared in the January/February 2026 edition of RedTech Magazine. You can read or download this edition for free here. You can access past editions of RedTech Magazine, also for free, here.
You might be interested in these stories
Super Hi-Fi and Connoisseur Media partner on AI
Saudi Media Forum to spotlight the kingdom’s media ambitions
