Speech and Vision with DFR1154 ESP32-S3 AI Camera

Jean-Philippe.Encausse May 02.2025

4 653 Medium

🧪 What if we could give senses to AI?

For the past few years, I’ve been experimenting with ESP32 modules, plugging them into my Alambic framework. The goal? Build a seamless voice ↔️ LLM ↔️ voice loop — in any language — enriched with vision and a wide array of sensors. In short, give senses to artificial intelligence.

🎉 Good news: I finally found the right device — for under $25!

✅ WebSocket communication with Node-RED
✅ Powerful enough for light Edge AI
✅ Compatible with dozens of sensors
✅ And yes… it talks! 🗣️ In any language!

🎥 Real-time demo available (a bit slow for now, blame the MS API latency). With a real-time model, it’ll be much smoother (see the wiki).

I was really impress with ChatGPT4o be able to describe with accuracy the dragon where the image quality was very weak !

⚙️ The real challenge?
👉 Taming C++ in the real world: managing Wi-Fi power, avoiding CPU crashes, distributing tasks across cores… every detail matters.

🔧 But there's still a major frustration: how does the Gravity connector actually work?

Thanks to DFRobot for the DFR1154, a brilliant board But… I’m stuck:

- Is the Gravity port UART-only, not I2C? If so, why such a limitation?- I tried connecting SEN0539 (for KWS) and DFR0997 (for camera), with no success.- There's no way to wire a button like DFR0785, which could trigger speech events.

I posted a detailed question on the DFRobot forum, but still waiting for answers.

🙏 If you have any tip, workaround, or code snippet to make the Gravity port more useful — I’m all ears.