Blog post

Easy Data Cleanup with Generative AI

edge ai
By Jim Bruges
Easy Data Cleanup with Generative AI

By harnessing the power of LLMs inside Edge Impulse, you can now clean up massive object detection datasets quickly and with minimal effort.

Training an effective edge AI model requires high quality data that is relevant to your use case. To train their first prototypes, engineers often reach for public datasets, but this can come with risks. Often these datasets are large but contain flaws such as mislabeled data, labels that are not relevant to your use case, or unwanted data augmentation. Manually reviewing all images and labels for a large dataset is time-consuming and error-prone. The newest Transformation Block in Edge Impulse, however, lets you validate your data in minutes and create high quality datasets using the power of multi-modal LLMs.

Check out this video that walks through the tool in a couple of scenarios.

How it works

Models such as OpenAI’s GPT-4o are getting better and better at interpreting images combined with text prompts. We can make use of this to ask the LLM validation questions about images in our dataset. This transformation block asks you for up to three validation prompts. These prompts represent statements that will result in a data sample being disabled. Some example validation prompts include:

These prompts are then passed to the LLM along with the bounding box label information for the current image. For example:

This extra information allows the LLM to check if the labels are correct (if you have asked to reference the given labels one of the prompts).

The response from the LLM is then a structured JSON format including the following:

The result of running this process over a large object detection dataset will be a dataset where any “unclean” data is disabled and not used for training. You can then use the dataset filtering tools in Edge Impulse to delete all disabled samples, or review disabled samples to see if any need re-enabling.

How to use the block

This feature requires an enterprise subscription to Edge Impulse.

  1. Go to an Enterprise project, upload an Object Detection dataset which you wish to validate. This can be a public dataset from somewhere like Kaggle. We support a number of industry standard dataset annotation formats.
  2. Choose Data acquisition->Data Sources->Add new data source.

    Select Transformation Block and the 'Validate Object Detection Datasets Using GPT-4o' block, fill in your prompts and and run the block. You can run on a fixed number of samples or over your entire dataset
You need an OPENAI API Key to run the GPT-4o model
  1. Any items which are invalid will be disabled and viewable in the data acquisition view. Reasoning will be provided in the metadata for each data item:
Untitled
Untitled

Check out the source code for this block on our GitHub if you want to explore how it works and try it out with an Enterprise Trial.

Here are the two example projects to which dataset validation has been applied and their dataset sources:

Industrial PPE Detection:

Truck Detection:

Comments

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter