Gesture MIDI Musical Instrument Based on Mediapipe Hand Gesture Recognition

auroraAA Nov 14.2025

0 113 Easy

1.Project Introduction

1.1 Project Overview

Wave your hand to conduct the music. No touch required!This project transforms the UNIHIKER K10 into a gesture-controlled musical instrument. By using MediaPipe hand gesture recognition, it detects and interprets common hand gestures to control music playback in real time. With just a flick of your wrist, you can pause, play, or switch tracks—like a true digital conductor.

Communication between the computer and the UNIHIKER K10 is established via SIoT. The K10’s onboard camera captures live video, which is streamed to the computer through the webcam interface. On the computer, MediaPipe performs real-time hand gesture recognition.When 1, 2, 3, or 4 fingers are raised, they correspond to four different music clips.An upward wave gesture triggers Play；A downward wave gesture triggers Stop；A left swipe gesture plays the Previous track；A right swipe gesture plays the Next track.

The recognition results are sent back to the UNIHIKER K10 via SIoT. The K10 then uses its onboard speaker to play and switch music accordingly. In this way, everyone can become a true “music conductor,” freely creating music through gestures.

1.2 Project Functional Diagrams

1.3 Project Video

2 Materials List

2.1 Hardware List

HARDWARE LIST

1 UNIHIKER K10

Link

1 microSD 16GB (TF) Class10 Memory Card

Link

1 TF Card/MicroSD Card Reader

1 USB-C Cable

Link

2.2 Software

Mind+ Graphical Programming Software (Minimum Version Requirement: V1.8.1 RC3.0)

2.3 Basic Mind+ Software Usage

(1) Double click to open the Mind+

The following screen will be called up.

Click and switch to offline mode.

(2) Load UNIHIKER K10

Based on the previous steps, then click on "Extensions" find the "UNIHIKER K10" module under the "Board" and click to add it. After clicking "Back" you can find the UNIHIKER K10 in the "Command Area" and complete the loading of UNIHIKER K10.

Then, you need to use a USB cable to connect the UNIHIKER K10 to the computer.

Then, after clicking Connect Device, click COM7-UNIHIKER K10 to connect.

Note: The device name of different UNIHIKER K10 may vary, but all end with K10.

In Windows 10/11, the UNIHIKER K10 is driver-free. However, for Windows 7, manual driver installation is required: https://www.unihiker.com/wiki/K10/faq/#high-frequency-problem.

The next interface you see is the Mind+ programming interface. Let's see what this interface consists of.

Note: For a detailed description of each area of the Mind+ interface, see the Knowledge Hub section of this lesson.

3. Construction Steps

The project is divided into three main parts:
(1) Task 1: UNIHIKER K10 Networking and Webcam Activation
Connect UNIHIKER K10 through IoT communication and enable the webcam function to establish a visual communication channel with the computer for transmitting video data.

(2) Task 2: Visual Detection and Data Upload
The UNIHIKER K10 camera transmits the captured gesture video information to the computer. The live video stream is processed through the computer's MediaPipe library. At the same time, the UNIHIKER K10 is connected to the MQTT platform, and the test results are uploaded to the SIoT platform by the computer.

(3) Task 3: UNIHIKER K10 Receiving Results and Executing Control
UNIHIKER K10 remotely retrieves inference results from SIoT. Meanwhile, it plays different music based on different hand gestures.

3.1 Task1: UNIHIKER K10 Networking and Webcam Activation

(1) Hardware Setup

Confirm that the UNIHIKER K10 is connected to the computer via a USB cable.

(2) Software Preparation

Make sure that Mind+ is opened and the UNIHIKER board has been successfully loaded. Once confirmed, you can proceed to write the project program.

(3）Write the Program

UNIHIKER K10 Network Connection

To enable communication between the computer and UNIHIKER K10, first ensure both devices are connected to the same local network.

Note: For more information about the MQTT protocol and IoT components, refer to the Knowledge Hub

First,add MQTT communication and Wi-Fi modules from the extension library. Refer to the diagram for commands.

After UNIHIKER K10 is connected to the network, its camera and networking functions are required to transmit camera information to the local area network (LAN), allowing access from any computer on the LAN at any time. This enables the computer to call many existing computer vision libraries for image recognition. Therefore, next we need to load the webcam library to ensure that the information captured by UNIHIKER K10 can be transmitted to the computer.

Click on "Extensions" in the Mind+ toolbar, enter the "User Ext" category. Input: https://gitee.com/yeezb/k10web-cam in the search bar, and click to load the K10 webcam extension library.
User library link: https://gitee.com/yeezb/k10web-cam

We need to use the "Wi-Fi connect to account (SSID, Password)"command in the network communication extension library to configure Wi-Fi for the UNIHIKER K10 terminal. Please ensure that the Wi-Fi connected to UNIHIKER K10 is the same as that of your computer. At the same time, we also need to use the "Webcam On" function to enable the webcam feature, so that the information captured by the UNIHIKER K10 can be transmitted to the computer.

Once the network connection is successfully established, the "Webamera On" block can be used to transmit the information captured by UNIHIKER K10 to the computer.

Then, we need to determine the IP address of the UNIHIKER K10 in the serial monitor. Use a loop to execute the command five times to display the IP address five times. Then, drag the serial content output block, select string output and enable line wrapping. Finally, obtain the WiFi configuration, select the acquired IP address, and display it once every second.

Click Upload button, when the burning progress reaches 100%, it indicates that the program has been successfully uploaded.

Open the serial port monitor, and you can see the IP of UNIHIKER K10.
When you open IP/stream in your browser, you can view the camera screen. For example, in the IP shown in the above picture, you can enter 192.168.1.12/stream in your browser.

3.2 Task2: Vision Detection and Data Upload

Next, UNIHIKER K10 needs to connect to the MQTT platform,and we will create an SIoT topic to store the results, and use a computer to detect gesture.The detection results will then be sent to SIoT for subsequent processing by the UNIHIKER K10.

(1) Prepare the Computer Environment

First,on our computer, we need to download the Windows version of SIoT_V2, extract it, and double-click start SloT.bat to start SIoT. After starting, a black window will pop up to initialize the server.Critical note: DO NOT close this window during operation, as it will terminate the SIoT service immediately.

Note: For details on downloading SIoT_V2, please refer to: https://drive.google.com/file/d/1qVhyUmvdmpD2AYl-2Cijl-2xeJgduJLL/view?usp=drive_link

After starting SIoT.bat on the computer, initialize the parameters for MQTT in UNIHIKER K10: set the IP address as the local computer's IP, the username as siot, and the password as dfrobot.

We need to install the required Pyrhon dependencies.Used to realize the recognition and processing of hand gesture information.Open a new Mind+ window，navigate to the Mode Switch section and select "Python Mode".

In Python Mode, click Code in the toolbar.Navigate to Library Management and the Library Installation page will open for dependency management.

Click "PIP Mode" and run the following commands in sequence to install five libraries including the mediapipe library.

CODE

pip insatll mediapipe
numpy
request
opencv-python
opencv-contrib-python

(2) Write the Program

STEP One: Create Topics

Access "Your Computer IP Address:8080", such as"192.168.1.11:8080",in a web browser on your computer.

Enter the username 'SIoT' and password 'dfrobot' to log in to the SIoT IoT platform.

After logging into the SIoT platform, navigate to the Topic section and create a topic: 'Gestures' (Used for storing control instructions of the UNIHIKER K10 display board ) .Refer to the operations shown in the image below.

Next, we will write the code for gesture detection, which includes the main functional modules: gesture detection and sending detection information to the SIoT IoT platform.

Follow these steps to create a new Python file named "PC.py" in Mind+:In the 'Files in Project' directory of the right sidebar in Mind+, create a new py file named "PC".

STEP Two: Gesture Recognition code

We use the mediapipe library to detect gestures, which can recognize both static and dynamic gestures. Then, publish the instructions on the Gestures theme on the SIoT platform.

CODE

import cv2
import requests
import numpy as np
import mediapipe as mp
import siot
import time
from collections import deque
import threading

# Initialize SIOT
siot.init(client_id="5417528847255362", server="192.168.1.11", port=1883, user="siot", password="dfrobot")
siot.connect()

# Run siot.loop() in a thread to avoid blocking the main thread
def run_siot_loop():
    siot.loop()

loop_thread = threading.Thread(target=run_siot_loop, daemon=True)
loop_thread.start()

# Initialize MediaPipe
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False,
                       max_num_hands=1,
                       min_detection_confidence=0.7,
                       min_tracking_confidence=0.7)
mp_draw = mp.solutions.drawing_utils

# Gesture utility functions
def count_fingers(landmarks):
    tips = [8, 12, 16, 20]
    mcps = [6, 10, 14, 18]
    count = 0
    for tip, mcp in zip(tips, mcps):
        if landmarks[tip].y < landmarks[mcp].y:
            count += 1
    if landmarks[4].x < landmarks[2].x:  # Detect thumb
        count += 1
    return count

history = deque(maxlen=15)

# Gesture state tracking
class GestureState:
    def __init__(self):
        self.last_gesture = None
        self.last_gesture_time = 0
        self.confirmed_gesture = None
        self.gesture_count = 0
        self.cooldown = 0.8  # Gesture cooldown time (seconds)
        self.mode = "static"  # Current mode: static/dynamic
        self.mode_lock_until = 0  # Mode lock expiration time
        
    def update(self, gesture, current_time):
        # Check mode lock
        if current_time < self.mode_lock_until:
            return False
            
        # Reset counter if gesture changed
        if gesture != self.last_gesture:
            self.gesture_count = 0
        
        # Increment counter
        self.gesture_count += 1
        
        # Check if confirmation condition is met (3 consecutive frames with same gesture)
        if self.gesture_count >= 3 and gesture != self.confirmed_gesture:
            # Check if within cooldown period
            if current_time - self.last_gesture_time > self.cooldown:
                self.confirmed_gesture = gesture
                self.last_gesture_time = current_time
                return True
        
        self.last_gesture = gesture
        return False
        
    def switch_mode(self, new_mode, current_time, lock_duration=2.0):
        """Switch mode and lock for a period of time"""
        if self.mode != new_mode:
            self.mode = new_mode
            self.mode_lock_until = current_time + lock_duration
            return True
        return False

gesture_state = GestureState()

# Read MJPEG stream
url = 'http://192.168.1.12/stream'  # Replace with actual accessible address
stream = requests.get(url, stream=True, timeout=10)
bytes_data = b''

# Independent timestamp variables
last_static_time = time.time()
last_dynamic_time = time.time()
last_mode_switch_time = 0

# Process MJPEG stream
for chunk in stream.iter_content(chunk_size=1024):
    if chunk:
        bytes_data += chunk
        a = bytes_data.find(b'\xff\xd8')
        b = bytes_data.find(b'\xff\xd9')
        if a != -1 and b != -1:
            jpg = bytes_data[a:b + 2]
            bytes_data = bytes_data[b + 2:]
            img = cv2.imdecode(np.frombuffer(jpg, dtype=np.uint8),
                               cv2.IMREAD_COLOR)
            if img is None:
                continue

            rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            res = hands.process(rgb)
            
            current_time = time.time()
            gesture_detected = False  # Flag indicating whether a gesture was detected in current frame
            dynamic_gesture = None
            static_gesture = None

            if res.multi_hand_landmarks:
                for hand_landmarks in res.multi_hand_landmarks:
                    mp_draw.draw_landmarks(img, hand_landmarks,
                                           mp_hands.HAND_CONNECTIONS)
                    lm = hand_landmarks.landmark

                    # Static gesture: count fingers (don't show 5)
                    finger_num = count_fingers(lm)
                    if finger_num < 5:  # Only show 1-4 fingers
                        cv2.putText(img, f'Fingers: {finger_num}', (10, 40),
                                    cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
                    
                    # Dynamic gesture: sliding
                    tip = lm[8]
                    history.append((tip.x, tip.y))
                    
                    if len(history) == 15:
                        dx = history[-1][0] - history[0][0]
                        dy = history[-1][1] - history[0][1]

                        # Detect movement direction
                        if abs(dx) > abs(dy) and abs(dx) > 0.15:
                            dynamic_gesture = "RIGHT" if dx > 0 else "LEFT"
                        elif abs(dy) > abs(dx) and abs(dy) > 0.15:
                            dynamic_gesture = "UP" if dy < 0 else "DOWN"
                    
                    # Mode switch detection: rapid finger opening/closing
                    if len(history) > 5:
                        # Detect finger opening/closing changes in recent 5 frames
                        recent_changes = []
                        for i in range(1, 6):
                            index = -i
                            prev_tip = history[index-1] if abs(index) < len(history) else None
                            curr_tip = history[index]
                            
                            if prev_tip:
                                dist_change = abs(curr_tip[0] - prev_tip[0]) + abs(curr_tip[1] - prev_tip[1])
                                recent_changes.append(dist_change > 0.1)  # Threshold judgment
                        
                        # If 3 or more frames in recent 5 frames have significant changes, trigger mode switch
                        if sum(recent_changes) >= 3 and current_time - last_mode_switch_time > 3.0:
                            new_mode = "dynamic" if gesture_state.mode == "static" else "static"
                            if gesture_state.switch_mode(new_mode, current_time):
                                print(f"Switched to {new_mode} mode")
                                last_mode_switch_time = current_time
                    
                    # Process gestures based on current mode
                    if gesture_state.mode == "static":
                        # Static gesture mode
                        if 1 <= finger_num <= 4:  # Only process 1-4 fingers
                            static_gesture = finger_num
                            gesture_detected = True
                            
                            # Send static gesture
                            if current_time - last_static_time > 0.5:
                                gesture_name = ["one", "two", "three", "four"][finger_num-1]
                                siot.publish_save(topic="siot/Gestures", data=gesture_name)
                                last_static_time = current_time
                                print(f"Sent static gesture: {gesture_name}")
                    
                    else:
                        # Dynamic gesture mode
                        if dynamic_gesture:
                            cv2.putText(img, f'Detected: {dynamic_gesture}', (10, 80),
                                        cv2.FONT_HERSHEY_SIMPLEX, 0.7,
                                        (200, 200, 0), 2)
                            
                            # Use state machine to confirm gesture
                            if gesture_state.update(dynamic_gesture, current_time):
                                confirmed_gesture = gesture_state.confirmed_gesture
                                cv2.putText(img, f'Confirmed: {confirmed_gesture}', (10, 110),
                                            cv2.FONT_HERSHEY_SIMPLEX, 1.2,
                                            (0, 255, 0), 2)
                                
                                print(f"Confirmed gesture: {confirmed_gesture}")
                                
                                # Send confirmed gesture
                                if current_time - last_dynamic_time > 0.5:
                                    siot.publish_save(topic="siot/Gestures", data=confirmed_gesture.lower())
                                    last_dynamic_time = current_time
                                    print(f"Sent dynamic gesture: {confirmed_gesture.lower()}")
                                
                                gesture_detected = True

            cv2.imshow('UNIHIKER Gesture', img)
            if cv2.waitKey(1) & 0xFF == 27:  # ESC to exit
                break

cv2.destroyAllWindows()

Since we need to obtain the camera screen from the UNIHIKER K10, we need to replace it with the IP address of your own UNIHIKER K10 in line 88 of the code.

Copy the code into the "PC.py" file created in Mind+.(Note: The complete "visiondetect.py" file is provided as an attachment for reference)

Click Run to start the program.

When your palm is fully present in the window, the program will detect your gesture. At the same time, the results of the recognition gesture will be displayed in the video frame.

You can also see real-time gesture status updates in the terminal window, as shown in the figure below.

3.3 Task 3：UNIHIKER K10 Receives Results and Executes Control

Next, we implement the last function, the K10 plays different music according to different gesture information based on the gesture information received in SIOT.

(1) Hardware Setup

Get an SD card and use the SD card reader to copy the 8 music clips to the SD card. If you want to replace music in WAV format, please note that the audio in WAV format needs to be an uncompressed dual-channel files. Conversion tools can be found here (https://www.unihiker.com/wiki/K10/faq/#audio).

Note: The full wav audio file is provided as an attachment for reference.

As shown in the figure below, insert the SD card with the copied wav audio file into the UNIHIKER K10 card slot.

(2) Software Preparation

Make sure that Mind+ is opened and the UNIHIKER board has been successfully loaded. Once confirmed, you can proceed to write the complete project program.

(3) Write the Program

STEP One: Subscribe to SIoT Topics on UNIHIKER K10

Find the MQTT initial parameter block and enter "siot/Gestures" here in Topic_0.

STEP Two: Control song playback based on gestures

Use the "When MQTT message received from topic_0" block to enable the smart terminal to process commands from siot/Gestures.

Operation skills: Pay attention to the trigger mode switching detection, you need to make a quick action of extending your fingers and then retracting, which will produce a rapid change in the position of your fingertips, triggering the mode switching detection. There is a 2-second lock time after switching to prevent accidental touch. The principle of detecting dynamic gestures is to track the movement of the index fingertip, calculate the displacement of the last 15 frames, and then judge the gesture according to the direction and amplitude of the displacement.

If the MQTT message is equal to "one", the 1.wav will be played. If the MQTT message is equal to "two", the 2.wav will be played. If the MQTT message is equal to "three", the 3.wav will be played. If the MQTT message is equal to "four", the 4.wav will be played. If the MQTT message is equal to "left", the previous wav will be played. If the MQTT message is equal to "right", the next wav will be played. If the MQTT message is equal to "up", any wav from 1-8 will be shuffled. If the MQTT message is equal to "down", playback will stop. At the same time, the UNIHIKER K10 screen shows which song is playing in real time.

Below is the reference for the complete program.

4.Upload the Program and Observe the Effect

Click Upload button,When the burning progress reaches 100%, it indicates that the program has been successfully uploaded.

STEP:

(1) First UNIHIKER K10 run the program.

(2) Click the Run button to start PC.py, and then point your palm at the camera:

- UNIHIKER K10 will take a video and the gesture information will be displayed in the video frame.

- Make different gestures and K10 plays different songs.

5. Knowledge Hub

5.1 What is Gesture Recognition and What are Its Applications?

Gesture recognition is a technology in the field of computer vision and human-computer interaction that identifies and interprets specific hand shapes, movements, and postures. The core is to extract meaningful information from hand visual data or sensor data to understand human intentions and commands. By analyzing the dynamic changes and static patterns of hands, it enables natural and intuitive interaction between humans and machines.

These technologies serve as the foundation for:

- Smart device control: Controlling devices like smart home systems and TVs through gestures (such as "swiping" to change channels or "rotating" to adjust volume).

- Sign language interpretation: Translating sign language gestures into text or speech in real-time, assisting communication for hearing-impaired individuals.

- Virtual and augmented reality: Using hand gestures as input methods in virtual environments (such as "grabbing" virtual objects or "gesturing" to manipulate interfaces).

- Security and identity verification: Using unique hand gesture patterns for biometric authentication and access control.

5.2 What are the Methods for Gesture Recognition? What are their Applications?

Gesture recognition is the process of detecting, analyzing, and interpreting hand movements and patterns through technical means. Based on technical principles and implementation methods, it can be divided into: (1) Vision-based recognition (non-contact), (2) Sensor-based recognition (contact/wearable), and (3) Depth camera-based recognition (3D perception).

In daily life, gesture recognition technologies have a wide range of applications, and they are the basis for the following practical applications:

- Automotive human-machine interaction: Controlling in-car systems (such as navigation, audio, and air conditioning) through gestures, reducing the need for physical buttons and improving driving safety.

- Sign Language Translation: Converts manual signs into text or speech, assisting deaf communities in communication.

- Medical assistance systems: Surgeons control medical imaging equipment through sterile gestures during operations, maintaining a sterile environment while manipulating images.

- Gaming and entertainment: Using body and hand movements as control inputs for games and interactive installations (such as Xbox Kinect or VR games).

- Industrial control: Operators control machinery and equipment through gestures in environments where physical contact is inconvenient (such as clean rooms or hazardous environments).