Bringing Large Language Models to the Edge with GPT-4o

The latest large language models (LLMs), like the newly released ChatGPT-4o, are truly astonishing. They are multimodal, meaning they have multiple senses and can interpret text, images, and audio information and respond in a seemingly human-like manner.

But how do they relate to edge AI, and what role does Edge Impulse play in this? This question comes up frequently in a range of conversations.

Today we are sharing an exciting new Edge Impulse functionality that shows just one of the powerful ways that LLMs can be utilized on the edge: using GPT-4o to train a 2,000,000x smaller model, that runs directly on device.

The challenge of LLMs for real-time insights

Let's say a business has a camera pointed at their factory floor. A manager can ask an LLM, "Is there a person not wearing a hard hat standing close to the machine?" The LLM is capable of analyzing the frame and answering yes or no; that response can then be used to trigger actions according to predetermined safety protocols.

This is incredibly powerful because 1. the model understands what it sees, and 2. you can use natural language to ask questions about what is happening. This approach is called zero-shot learning, where you don't need to train a model beforehand.

However, there are downsides to deploying zero-shot models in real environments. State-of-the-art LLM models have hundreds of billions of parameters, making them huge and slow. They typically run in the cloud, leading to latency issues. Even fast models like ChatGPT-4o take more than a second to analyze a single image, which is too slow for many real-time applications.

One solution: Labeling data with LLMs and Edge Impulse

To address these issues, we need similar capabilities that can run closer to or even on the edge. Imagine the same factory floor but instead with a simple camera where everything runs locally. This setup offers low latency and low cost, without the need for cloud services.

Using this type of system locally with LLMs isn't feasible, given the massive size of LLM models. However, with Edge Impulse, one can use an LLM's understanding to train a much smaller model. This process involves a form of transferring knowledge from a large model to a smaller one.

Edge Impulse specializes in bringing AI to the edge, with over 300,000 projects created to build AI applications that run self-contained on devices from Arm Cortex-M (via open-CMSIS, reaching over 10,000+ devices) to NVIDIA-based GPUs and beyond.

The new Edge Impulse LLM functionality allows a user to bring LLM-based AI to the edge to automatically analyze and label visual data, including video uploads, and apply a specific understanding of what's in frame without needing someone to manually input any labels. Using the LLM's intelligence on just a specific request allows Edge Impulse to train a vision-based model from it that is compact enough to run on-device. It's a powerful and exciting application.

Project walkthrough

Let's build a project to see how this works. In the above video I walked through my house, capturing a video of my living room filled with my kids' toys. Using ChatGPT, I can ask questions about this video, such as whether a children's toy is in view. Here's how we can create a smaller model to perform this same task, on-device:

Data Collection: I recorded a video of my living room.
Data Processing: I uploaded the video to Edge Impulse. The data was initially unlabeled.
Labeling with GPT-4o: Using the new transformation block "Label image data using GPT-4o," I asked GPT-4o to label the images. The block discards any blurry or uncertain images, providing a clean dataset.
Model Training: I split the video into images, resulting in about 500 labeled items. I used NVIDIA TAO to train a small model with these images. [Editor note: We've deprecated our TAO integration; alternate options can be found here]. The model had only 800,000 parameters, significantly smaller than the original LLM.
Deployment: I deployed the model, running at 10 frames per second on an Arduino Nicla Vision. The model accurately detected toys on-device, in real-time, without requiring cloud services.

Results

The small model performed exceptionally well, identifying toys in various scenes quickly and accurately. By distilling knowledge from the large LLM, we created a specialized, efficient model suitable for edge deployment. The smaller model ran at 50 frames per second in a browser on my iPhone and even on a microcontroller with minimal accuracy loss.

Get started

Our "Label image data using GPT-4o" block is available for enterprise customers, allowing you to experiment with this technology. Sign up for a free account at Edge Impulse and then use your company email to start an enterprise trial so you too can explore how you can leverage LLMs on the edge.

Feel free to reach out if you have any questions or need further assistance with your edge AI projects on our forum.

Capabilities

Built for

Industries

Applications

Technical resources