Object Classification with ESP32-S3 AI Camera: Distinguishing Hand Cream from Pencils

Introduction:

In this article, I would like to take you on my little journey with the tiny ESP32-S3 AI Camera (DFR1154).

🔧 Step 1: Setting up the Arduino IDE for the ESP32-S3 AI Camera (DFR1154)

Before diving into machine learning or camera-based projects, I had to set up the ESP32-S3 AI camera with the Arduino IDE.

💡 The best is to follow this tutorial: https://wiki.dfrobot.com/SKU_DFR1154_ESP32_S3_AI_CAM

⚠ Please ensure that you enable the CDC bits to use the serial monitor and obtain debug information from the board.

🔍 Step 2: Testing the First Examples – Light Sensor & Web Server

I started simple by testing two built-in examples:

Ambient light sensor reading (via I2C)

Camera WebServer (serving a live stream via Wi-Fi)

Both examples helped verify the hardware setup and explore the capabilities of the ESP32-AI-CAM.

I was able to set up the WebServer project very quickly. With this project, I was able to start collecting pictures for the training and test data set for my own AI project.

🚨 Step 3: Getting Stuck – The Edge Impulse Reboot Loop

I attempted to flash an EdgeImpulse project (image classification) directly to the board using the exported firmware. Unfortunately, this resulted in a continuous reboot loop immediately after boot:

What happened:
I uploaded a sketch that crashes on boot (Guru Meditation Error).

Since then, the board continuously resets, and I can’t successfully start an upload anymore.

Serial output after reset:




What I tried:
BOOT + RESET button combinations

Holding BOOT, pressing RESET, then starting upload → no success.

This temporarily bricked the board (soft brick), as we couldn’t access it via serial anymore.

💡 The solution was: (https://esp32.com/viewtopic.php?t=46266)

Holding BOOT for the whole time, pressing RESET once for a short time, then starting upload (while still holding BOOT the while time) → success.

📷Step 4: My first AI project: distinguish between pencil and hand cream

In the next step, I set up a small AI project to enable the AI camera to distinguish between pencil and hand cream. I know that making this distinction is a very important issue ;)


To do this, I used EdgeImpulse to take over 117 photos of a pencil and a tube of hand cream, as well as the background. The project should therefore be able to distinguish between three states: hand cream, pencil and unknown. I have taken the pictures with my smartphone and the ESP32 Cam from different angles and different backgrounds.

For this project I have followed this tutorial: https://wiki.dfrobot.com/EdgeImpulse_Object_Detection

💡 I would like to add the following points, which I believe were missing from the tutorial (at least when I used it) - both are settings within the EdgeImpulse WebApp:

-the target must be set to ‘ESP-EYE’
-and TensorFlow Lite as the compiler

Here we can see the dataset with the pencil and the hand creme:

After training, we evaluated the model using Edge Impulse's confusion matrix:

Hand Cream: 87.5% correctly classified, but 12.5% were misclassified as Pencil.

Pencil: 83.3% accuracy, with 16.7% confused as Hand Cream.

Unknown: Correctly identified 80% of the time, but occasionally misclassified as Hand Cream.

➡️ Why is this important?

The confusion matrix reveals exactly which mistakes the model makes:

Misclassifications between Hand Cream and Pencil suggest that the visual differences between these objects are not always clear to the model.

The model relies heavily on shape and texture cues, which might overlap in some images (e.g., cylindrical shape).

➡️ Other Metrics Observed:

Accuracy: 84.2%

Weighted Precision / Recall / F1: ~84–85%

ROC AUC: 0.96 (indicating good class separability in general)

🖨 Results:

when pointing the camera to a pencil we get the following results:

When we point the camera at a tube of hand cream, we get much better results:

Outlook:

I am currently working on a project for posture detection using AI. The goal is to distinguish between ‘good’ and ‘bad’ posture in photos. However, the first prototype of the model still has significant weaknesses. As can be seen in the confusion matrix, the model is currently unable to reliably distinguish between the two classes. The accuracy is only 25%, the precision is 0.06 and the F1 score is 0.10 – practically at random level.

In the next step, I would therefore like to improve the quality and diversity of my training data and test alternative model approaches in order to significantly increase the ability to distinguish between the different poses.

License
All Rights
Reserved
licensBg
0