When deploying wildlife monitoring models to remote cameras in the field (known as camera traps), a common challenge is dealing with unexpected animal species that were not accounted for during training. For instance, the camera may misclassify a wolverine as a wolf because "wolverine" was not included in the dataset, but it shares enough features with the wolf to generate a false detection.
In this blog post we'll look at a way to overcome that, making use of:
- Synthetic data generation (recently released!)
- AI-assisted labeling in Edge Impulse's labeling queue
- Transformation blocks for labeling unknown bounding boxes with GTP-4o
Especially when developing a camera trap solution for inference at the edge, it makes sense to train the model only on the relevant animal classes that inhabit the specific area. This approach helps avoid confusing the model, for example by not including the "Eurasian Lynx" as a class when deploying the camera in Canada, where only the "Canadian Lynx" is expected — and also to be mindful of the model complexity.
Many camera trap applications use the MegaDetector deep learning model for animal localization, and then run image classification on each region of interest (ROI). In this project we'll follow their approach of having one model for detecting any animal, i.e. just identifying frames that contain "animal" objects and their bounding boxes. (This is the first step in the suggested approach below.)
An active learning pipeline could be set up as follows:
- Deploy an object detection model that detect the general class "animal." For each detection, save the bounding box coordinates and the raw image.
- Run all these collected images through a script that prompts an LLM for the animal species given the image and information on the habitat, and replace the unknown label with the label returned.
- From here we have two directions to go:
- Use this new dataset to train a new object detection model capable of differentiating the specific species. Replace the data collecting deployment or deploy side by side. (This is the approach I explore further in this particular blog post).
- Or, update the dataset for the object detection model so that it now has the identified species (e.g. "black bear" and "lynx") but also still identifies the general "animal" to keep separating any unknown animal. Note: In this step we'd also want to relabel any images with black bears or lynxes that were previously used as "animal", so optimally that dataset should have the species as part of the filename or as metadata. Then retrain the model and update the deployment.
Inspired by MegaDetector, I started gathering a dataset to train an object detection model with only one class: "animal". The actual MegaDetector includes the classes "human" and "vehicle" as well, which I'll likely add later — stay tuned! The training dataset for MegaDetector contains millions of images, most of which are available in open datasets from camera traps worldwide. For my proof-of-concept model, I chose to explore whether a smaller amount of synthetic data could be effective for training this type of model. The prompt used was:
"A realistic camera trap image of a nature scene featuring an animal which is normally present in Nordic forests. The setting is a forest and the image has the typical quality of a camera trap photo, with slight graininess and natural lighting. The animal is captured in its natural behaviors"
The generated synthetic images are initially unlabeled. Also here I made use of a larger model, but labeled them quickly using the "label with YOLOv5" option in the labeling queue. I accepted the suggested labels, such as "cat" for a lynx, and then relabeled all the labels at once as "animal".
Using this small dataset of 116 items, I trained a MobileNetV2 SSD object detection model, achieving a training accuracy of 84.6%. I quickly validated the model's performance on real footage by connecting my phone in classification mode and pointing it at a camera trap video recording. It successfully detected the animals!
The next step was to deploy the model and expose it to a real-world environment. In this case, I envisioned my deployment site to be the Yukon Wildlife Preserve in Canada. For simulating this, I chose Docker deployment in Edge Impulse and ran it on my laptop, feeding it with snapshots from actual camera trap recordings from Yukon using a simple Python script. As a result of approximately 30 minutes of compiled interesting shots, I obtained 520 images with detected animals, which I saved as raw images along with the bounding boxes in YOLO TXT format (this is one of several formats supported in Edge Impulse).
I then uploaded this data to a new project to develop a target model that can differentiate between species in this wildlife habitat. The bounding box labels were still all "animal," and manually relabeling them would be cumbersome. To address this, I created a custom transformation job. I was inspired by the public transformation job "Label image data using GPT-4o" to label images for image classification, but as my project is for object detection, I created a new variant to do the following:
- Select all samples with a certain label that represent "unlabeled" data (in this case "animal").
- Prompt GPT-4o to identify the species and attach the correct label to the bounding box generated by the model.
(For now, I limited the complexity to only support images with one bounding box). The prompt I used for this is:
"There's an animal in this picture, respond with only the name of the animal species (all in lowercase), or "unsure" if you're not sure. Keep in mind that the animal's habitat is Yukon wilderness in Canada, and you can be specific about the species"
The result of this indeed sped up the labelling process; I just had to join a couple of classes that represented the same species but got different spellings, e.g. "Alaska moose" and "Alaskan moose," which was done with a quick batch relabelling. Apart from that, very few samples got the "unsure" label assigned and those were indeed tricky to classify, e.g. an animal exiting the frame. This process creates the knowledge of which classes to include in the final species model. For species with only a few samples, I disabled them until more data was available to maintain a balanced class ratio. Seven species were considered to have enough data: Alaska moose, American black bear, Canada lynx, Coyote, Grizzly bear, and Northwestern wolf.
I trained an YOLOv5 object detection model with an input resolution of 320x320. When all seven species were included and with real footage only, the training precision score was 60.5% (COCO mAP).
I figured that more training data was likely needed to increase the model accuracy, so I turned to synthetic data generation once again. This time, I specifically requested images of the species that are present at my deployment site. I prompted to generate 55 images for each species and place them to the training dataset. Adding synthetic data (300 images in total), the training precision increased to 77.3%.
The model performed comparably well on validation data from a separate camera trap recording, achieving an accuracy of 74.3% with the confidence score threshold set to the default 0.5.
In conclusion, I'm pleased to see how foundational models can significantly speed up the labeling process of camera trap data, and a taste for how we could integrate them into an active learning pipeline. I'm also eager to continue working on the "edge variant" of MegaDetector, as it could be beneficial for many real-world applications.