Bringing Voice Control to Earbuds & Headsets

“With the help of Edge Impulse, they were able to collect keyword data, train an industrial-grade ML model, and deploy it into their own custom workflow within just months.”

‍

When one of the best-known brands in business and personal productivity hardware set out to build their next generation of wireless headphone hardware, they looked for a functionality that had never been done before. The resolution? A cutting edge voice-control feature, built directly into their headphones, to allow users the ability to answer or decline phone calls from their connected devices.

‍

Using a keyword-detecting AI algorithm running directly on-device, without any cloud interfacing at all, this feature is the culmination of months of collaboration between the technology device maker and Edge Impulse to train, build, and deploy, en masse, not just a novel function that responds to the two keywords, but works with eleven different languages.

These headphones are now available for sale globally, with the voice-command functionality fully accessible. Let’s take a deeper look at how this came together and made its way to the world.

‍

‍

Problem

Initially, the company looked to solve the clunky way to interact with incoming calls through the traditional headset user experience: Responding to calls requires a physical button press on a hard-to-spot part of the headset, or on the phone itself. This is disruptive, especially when hands are being used for typing or writing.

‍

They came up with a great solution for this — voice control. But to implement voice control, the team faced another challenge: How to efficiently enable the headset to respond to voice commands?

‍

Solution

Implementing voice-commands requires the use of AI algorithms that detect specified, spoken keywords.

‍

The traditional solution for this type of use case is to record audio using an onboard microphone, send that audio data up to a cloud, where server-based algorithms would detect if the keyword was said, then send the output response back to the device.

‍

However, the company opted for a new solution: Running a keyword detection algorithm directly on the headset itself, without any data transmission. This onboard AI processing is known as edge AI, and offers many benefits from the traditional method:

‍

Low Latency — since the KWS algorithm can run directly on the headset, the voice command can be detected without needing to send the data to the cloud and back. This low latency is crucial for real-time applications like answering or rejecting calls.

‍

Connectivity Independence — even if the headset is in a region where it can’t connect to the cloud, the KWS model can still function. Not needing to constantly be connected to the cloud can also provide power savings.

‍

Increased Security — as this company works with both consumers and businesses, privacy and security is important especially when it comes to human audio. Processing human audio on the edge instead of transmitting it to the cloud minimizes the risk of a data breach of sensitive voice data.

‍

Implementing an edge AI solution

The company used Edge Impulse’s platform to give their engineers the tools to fast-track their model development and deployment phase. In fact, with the help of Edge Impulse, they were able to collect keyword data, train an industrial-grade ML model, and deploy it into their own custom workflow within just months.

‍

Dataset Collection

The first step in building an AI solution is collecting data. The company generated and labeled the KWS dataset themselves. They were able to use tools such as Edge Impulse’s Keyword Collector web application to utilize the power of the crowd for data collection. This web application records your audience’s keyword samples and automatically splits the full recording into one-second sub-samples after identifying when each keyword stops and ends via audio signal processing techniques. This tool can be easily distributed to crowds using a QR code where the URL can contain useful data such as the desired keyword label, length in time, and audio frequency.

‍

*Edge Impulse’s Keyword Collector web application used to collect keywords*

‍

With Edge Impulse’s tools and outsourcing data collection, The company was able to collect hours and hours of labeled audio keyword data in over 10 different languages.

‍

This data was then stored in an AWS S3 bucket, which allows for direct integration into Edge Impulse. This lets users to quickly and easily import data and update models when new data is loaded into the S3 account.

‍

*Audio keyword dataset inside Edge Impulse*

‍

Digital Signal Processing

Once a dataset is collected, it needs to be processed to allow useful details to be determined. Edge Impulse offers pre-made processing blocks for feature extraction such as the Audio MFE and Audio MFCC blocks. However, the company needed something a bit more custom for their hardware. They were able to create a custom processing block that was optimized for their hardware. The block is used to extract time and frequency features from the signal and performs well for speech recognition, in multiple languages. Once this custom processing block was imported into their Edge Impulse organization, anyone in the org could use that block, either running it locally or in the cloud within Edge Impulse. This enabled easy collaboration between different engineers working on different projects for each of the KWS languages.

‍

‍

*Feature explorer after extracting features from the English keyword dataset*

‍

Machine Learning

After the features were extracted from the dataset, the company needed to create a neural network classifier. A neural network classifier will take some input data, and output a probability score that indicates how likely it is that the input data belongs to a particular class. In this use case, the input is audio data and the classes were “answer,” “ignore,” or “noise.” Note that the noise class includes both unknown sounds as well as background noise. The company utilized Edge Impulse infrastructure to train models using GPUs to quickly iterate through different sets of data, processing blocks, and neural network architectures. Within weeks, they generated models for all of multiple languages, most of them having over 98% accuracy and all fitting well within the latency and RAM and FLASH usage requirements.

‍

*Model performance metrics shown in Edge Impulse*

‍

Edge Impulse’s tool called Performance Calibration was also used to test, fine-tune, and simulate running the KWS models using continuous real-world and synthetically generated streams of data. This gave the company a specific post-processing configuration to tailor the model to minimize either false activations (False Alarm Rate — FAR) or false rejections (False Rejection Rate — FRR).

‍

‍

Workflow Deployment

Once the team was confident that the model performance metrics looked good on paper, it was time to try it out on-device. They generated a TensorFlow Lite model in Edge Impulse and integrated it within their own codebase. Once the firmware engineers had tested the model thoroughly, all the language models were uploaded to their own public-facing app, available on the web, through mobile, or desktop and can be used to update the headset’s firmware over-the-air (OTA).

‍

When the feature is enabled in the app, the appropriate model will be loaded to the headset via OTA, based on the language set on the device. If the user ever switches the language, the new model will be uploaded to the headset via OTA.

‍

The integration of Edge Impulse's AI technology into these earbuds and headsets represents a significant step in embedded, hands-free voice control. By leveraging on-device keyword detection, this approach provides low-latency, connectivity-independence, and high security for the interface. This innovation not only enhances user experience but also highlights the opportunity of incorporating edge AI into modern consumer electronics.

‍

The successful deployment of this technology, enabled by Edge Impulse's robust platform, showcases the potential of AI to transform everyday devices. This collaboration sets a new standard for what is possible with voice control, promising even more advancements in the realm of smart, intuitive, and user-friendly solutions.

‍

CASE STUDY