Improving Camera Traps to Identify Unknown Species with GPT-4o

When deploying wildlife monitoring models to remote cameras in the field (known as camera traps), a common challenge is dealing with unexpected animal species that were not accounted for during training. For instance, the camera may misclassify a wolverine as a wolf because "wolverine" was not included in the dataset, but it shares enough features with the wolf to generate a false detection.

In this blog post we'll look at a way to overcome that, making use of:
- Synthetic data generation (recently released!)
- AI-assisted labeling in Edge Impulse's labeling queue
- Transformation blocks for labeling unknown bounding boxes with GTP-4o

A common challenge in deploying wildlife monitoring models is adapting to unaccounted-for species at the deployment site

Especially when developing a camera trap solution for inference at the edge, it makes sense to train the model only on the relevant animal classes that inhabit the specific area. This approach helps avoid confusing the model, for example by not including the "Eurasian Lynx" as a class when deploying the camera in Canada, where only the "Canadian Lynx" is expected — and also to be mindful of the model complexity.

Many camera trap applications use the MegaDetector deep learning model for animal localization, and then run image classification on each region of interest (ROI). In this project we'll follow their approach of having one model for detecting any animal, i.e. just identifying frames that contain "animal" objects and their bounding boxes. (This is the first step in the suggested approach below.)

An active learning pipeline could be set up as follows:

Inspired by MegaDetector, I started gathering a dataset to train an object detection model with only one class: "animal". The actual MegaDetector includes the classes "human" and "vehicle" as well, which I'll likely add later — stay tuned! The training dataset for MegaDetector contains millions of images, most of which are available in open datasets from camera traps worldwide. For my proof-of-concept model, I chose to explore whether a smaller amount of synthetic data could be effective for training this type of model. The prompt used was:

"A realistic camera trap image of a nature scene featuring an animal which is normally present in Nordic forests. The setting is a forest and the image has the typical quality of a camera trap photo, with slight graininess and natural lighting. The animal is captured in its natural behaviors"
💡
To later enhance the model, training samples of smaller animals like birds and rodents could be included, along with a variety of different habitats. Additionally, using non-synthetic data could likely improve the model's accuracy on real data.
Creating images of various animals with Synthetic data generation

The generated synthetic images are initially unlabeled. Also here I made use of a larger model, but labeled them quickly using the "label with YOLOv5" option in the labeling queue. I accepted the suggested labels, such as "cat" for a lynx, and then relabeled all the labels at once as "animal".

Labeling with AI-Assisted labeling in the Edge Impulse labeling queue. Batch relabeling could then quickly update all "cat" labels to "lynx"

Using this small dataset of 116 items, I trained a MobileNetV2 SSD object detection model, achieving a training accuracy of 84.6%. I quickly validated the model's performance on real footage by connecting my phone in classification mode and pointing it at a camera trap video recording. It successfully detected the animals!

The next step was to deploy the model and expose it to a real-world environment. In this case, I envisioned my deployment site to be the Yukon Wildlife Preserve in Canada. For simulating this, I chose Docker deployment in Edge Impulse and ran it on my laptop, feeding it with snapshots from actual camera trap recordings from Yukon using a simple Python script. As a result of approximately 30 minutes of compiled interesting shots, I obtained 520 images with detected animals, which I saved as raw images along with the bounding boxes in YOLO TXT format (this is one of several formats supported in Edge Impulse).

I then uploaded this data to a new project to develop a target model that can differentiate between species in this wildlife habitat. The bounding box labels were still all "animal," and manually relabeling them would be cumbersome. To address this, I created a custom transformation job. I was inspired by the public transformation job "Label image data using GPT-4o" to label images for image classification, but as my project is for object detection, I created a new variant to do the following:

(For now, I limited the complexity to only support images with one bounding box). The prompt I used for this is:

"There's an animal in this picture, respond with only the name of the animal species (all in lowercase), or "unsure" if you're not sure. Keep in mind that the animal's habitat is Yukon wilderness in Canada, and you can be specific about the species"
Running a custom Transformation block from the project's Data sources

The result of this indeed sped up the labelling process; I just had to join a couple of classes that represented the same species but got different spellings, e.g. "Alaska moose" and "Alaskan moose," which was done with a quick batch relabelling. Apart from that, very few samples got the "unsure" label assigned and those were indeed tricky to classify, e.g. an animal exiting the frame. This process creates the knowledge of which classes to include in the final species model. For species with only a few samples, I disabled them until more data was available to maintain a balanced class ratio. Seven species were considered to have enough data: Alaska moose, American black bear, Canada lynx, Coyote, Grizzly bear, and Northwestern wolf.

Sample filtering in the Data acquisition view

I trained an YOLOv5 object detection model with an input resolution of 320x320. When all seven species were included and with real footage only, the training precision score was 60.5% (COCO mAP).

Impulse design of object detection model for object detection of seven species

I figured that more training data was likely needed to increase the model accuracy, so I turned to synthetic data generation once again. This time, I specifically requested images of the species that are present at my deployment site. I prompted to generate 55 images for each species and place them to the training dataset. Adding synthetic data (300 images in total), the training precision increased to 77.3%.

Synthetically generated images of 'Northwestern wolf' and labeled with AI-Assisted labeling in Studio labeling queue

The model performed comparably well on validation data from a separate camera trap recording, achieving an accuracy of 74.3% with the confidence score threshold set to the default 0.5.

Model testing with unseen data, achieving an accuracy of 74.3%
Classification of snapshot from camera trap in Yukon Wildlife Preserve

In conclusion, I'm pleased to see how foundational models can significantly speed up the labeling process of camera trap data, and a taste for how we could integrate them into an active learning pipeline. I'm also eager to continue working on the "edge variant" of MegaDetector, as it could be beneficial for many real-world applications.

Comments

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter