Smart home devices, like IoT security cameras, light bulbs, and voice assistants, have become a fixture in many households. Thanks to big drops in the cost of this type of hardware in recent years, it's become way more accessible for people to set up their homes with all sorts of connected gadgets. But the price on the box is only part of the story — there are also some hidden costs associated with bringing this equipment into your home.
These devices frequently transmit (or could transmit) sensitive information from within your home to a cloud-based service provider. The developer may put a lot of sensible safeguards in place to protect your data, but when you are talking about live video and audio streams from your living room, is that enough to make you feel comfortable? Considering that there seem to be reports of exploited security vulnerabilities in every day’s headlines, it can be hard to fully trust anyone with your private data.
DIY AI
There is another option, of course — you can kick the cloud to the curb and build your own connected devices. But that would be really difficult, wouldn’t it? Maybe rolling the dice with a commercial service wouldn’t be so bad in comparison to that? Before you decide, you should consider the custom AI doorbell built by Roni Bandini. Video data never leaves his device, and the all-in-one hardware supplies can be purchased right off the shelf — no degree in electrical engineering required.
Bandini’s plan was to use DFRobot’s ESP32-S3 AI Camera Module 1.0 to power the AI doorbell. It can be programmed to capture images at a regular interval, then analyze those images using a machine learning model developed with Edge Impulse. Once the model indicates that a face has been detected, the device will ask the visitor for their name. Their response is sent to ChatGPT, which makes a decision as to whether or not they may enter the home. If so, a relay unlocks the door.

The inclusion of ChatGPT in the workflow does send a minimal amount of data to the cloud, but this step is optional. Bandini’s primary motivation for its use was to avoid issues with, for instance, improperly transcribed names that could inappropriately deny someone access. And for those concerned that gaining access to one’s home is too easy with this approach, the system can also send Telegram messages for those that would like to take a more hands-on approach. When you build your own devices, everything can be customized.
The ESP32-S3 AI Camera Module 1.0 comes equipped with all the hardware needed for the project. The two-megapixel wide-angle infrared camera can capture images of visitors day or night, and an onboard microphone and speaker handle the voice interactions. The ESP32-S3 microcontroller may have modest computing power, but with a well-oiled machine learning model created with Edge Impulse, it will do just fine. Best of all, the AI Camera Module costs less than any commercial IoT security camera on the market.
I love it when a plan comes together
To keep everything looking nice, Bandini designed a 3D-printed case to house the hardware. But before putting the hardware in the case, it would first need to be programmed to do its job. The most important part of that job is face detection, which is a task Bandini knew an object detection model would be well suited for. These models must be trained on a set of representative data before they are ready for use, so Bandini channeled his inner Zoolander and snapped some photos for this purpose.

The images were uploaded to Edge Impulse using the Data Acquisition tool. Object detection models also need to know specifically what objects in the images it needs to recognize, so the Labeling Queue tool was used to annotate the images with a bounding box around each face. This can be tedious to do for a large dataset, but this tool provides AI assistance to greatly speed up the annotation work.
With the data in place, it was time for Bandini to build the impulse. An impulse defines exactly how data is processed, from the time it leaves the sensor until a prediction is made by the machine learning model. In this case, the images were reduced in size during preprocessing. This has the effect of reducing the computational load of downstream steps, which is essential when working with an ESP32 microcontroller. Finally, that data is fed into FOMO, which is Edge Impulse’s in-house object detection model that excels at making accurate predictions on severely resource-constrained hardware platforms.

Bandini trained this model using the dataset that was previously uploaded. When the process finished, an F1 score of 76.5% was reported. This is good enough to prove the concept, but if you need a higher degree of accuracy, that can be achieved by providing the model with more training data from which to learn.
Since Bandini had more work to do — like collecting audio samples and hooking into ChatGPT — he went with one of the most versatile options for model deployment, an Arduino library. Using this option, any arbitrary code one needs to support their application can be integrated with the machine learning model as it produces its predictions.
Opportunity? Is that you knocking?
Before installing this AI doorbell prototype at your front door, you can probably think of a few things you would like to tweak. Fortunately, Bandini has made all of his source code publicly available, so you can go grab it to get a head start on those changes. He has also written up a detailed description of the work to answer any other questions that you may have along the way.