Automated Articulatory Speech Synthesis for Animation

In animation, aligning mouth movements has traditionally always been a painstaking process of carefully working through tiny movements, which is doubly time-consuming when you are animating in a language or dialect that is not your native tongue. But what if we could automate that process? That’s exactly what Zack Qattan’s Edge Impulse project aims to solve and automate, on-device. The result has implications beyond animation, offering benefits into broader accessibility as well.

Quick aside: This project is an interesting use case and public project that Qattan (a co-founder of Brilliant Sole) originally shared on our forum. Like Qattan, I too used to work in animation in a language that I don’t speak, using the same software. This one brought back memories.

It was enjoyable to catch up with Qattan and learn about this project. The content in this blog comprises our conversation, and his views on where this can be used. If you have an interesting project like this please share it with us.

What Are Phonemes?

Before diving into the project at hand, let’s start with the basics: Phonemes are the smallest units of sound in a language. Think of them as the building blocks of speech, individual sounds like “p,” “b,” “t,” or “l.” When we speak, we combine phonemes into syllables and words; each language has its unique set. For animators, matching these phonemes to corresponding mouth shapes (called visemes) is crucial for realistic lip-sync. This is a problem Qattan and I are both acutely aware of.

An interesting solution to this is Qattan’s Phoneme Classifier, used to drive the Pink Trombone articulatory speech synthesizer. While we advise careful and ethically responsible use of any speech synthesis tool (see our Responsible AI License), this approach offers the possibility for avatar control in animation.

Check out a demonstration of the project in action here:

Controlling Pink Trombone with Phoneme Classifier made in Edge Impulse

Why this matters for animators (and everyone else)

Traditional pipelines involve:

Recording dialogue
Breaking down each sentence into phonemes
Matching each phoneme to a mouth or blend shape
Fine-tuning coarticulation, pacing, and emotion

A phoneme classifier + articulatory speech synthesizer automates much of this. Not only does it save time, it frees animators and sound designers to focus on creative storytelling rather than tedious technical work. Beyond animation, this technology could reshape how we engage with VR, gaming, and broader accessibility use cases.

From Tedious Lip Sync to Automatic Control

Zack trained a classification model to perform:

Real-time Detection: A classifier (powered by Edge Impulse) identifies phonemes like “l” vs. “r”
Synthetic Speech: Pink Trombone instantly generates corresponding articulations
Lip-Sync Simplified: Animators can map phoneme data to 2D or 3D characters — no more frame-by-frame drudgery

Beyond Lip Sync

Avatar Navigation: The same phoneme cues can control character movement (turn left/right) in a WebVR environment — excellent for accessibility
Pronunciation & Speech-to-Text: Focusing on phonemes could improve language learning and accuracy in transcription
Phonetic Text Editing: Fine-tune speech output (pitch, accent, timing) without re-animating
Voice Filtering: Potential to modify accents or address speech impediments on the fly
“Phonts” (Phonetic Fonts): Stream phoneme data instead of raw audio — transform or anonymize voices in real time

Looking forward

The possibilities are practically endless.

Future inputs might include in-mouth wearables (like the Augmental MouthPad) or EMG earbuds to capture subtle muscle movements around the mouth or neck.

From a production standpoint, think about how you can tweak a character’s performance after the fact without re-recording or re-animating everything. That’s a total game-changer for animation studios and indie creators alike.

From an animator’s perspective, I’m personally excited about how quickly this field is evolving, tweening (generating in-between frames) seemed like magic; now we have so much more. We’re moving toward a future where speech creation, modification, and lip sync can happen almost instantaneously, powered by machine learning models that do the heavy lifting. It’s a stark contrast to the manual workflows I used when first animating for a long forgotten TV series!

It's also worth noting that there are connected versions of this that use OpenAPI to do the same process. We are aware of a Blender plugin that does so as well. However, doing this on an edge device is not something I have seen before, and one that can run on an MCU has much more potential for accessibility than connected.

So, if you’ve ever dreaded staying up late to lip-sync a single line of dialogue, or if you’re simply curious about the future of accessibility and animation, take a look at these tools. They might just save you from a few more of those marathon animation sessions and unleash a whole new realm of creative possibilities.

Please share any project you are working on in our forum. We love to hear about the interesting ways our community members are using edge AI.

Resources:

Video: Showcasing phonetic text-to-speech, pitch curves, pronunciation assistance, and more with Pink Trombone
Edge Impulse Project: Dive into the machine learning aspect and see how phoneme classification works under the hood
GitRepo: Demo code from Qattan

Capabilities

Built for

Industries

Applications

Technical resources