Project Background
Making sound visible to the "deaf"! There are many different kinds of disabled people in the world who either cannot speak or cannot hear. Deaf people are those whose hearing is impaired either genetically or artificially, also known as hearing-impaired people. According to the recent Chinese census, there are about 27 million hearing-impaired people in China, including weak hearing, hard of hearing and aging deafness. Because of various inconveniences, their life in this world is extremely inconvenient. For example, if someone knocks at the door, or if a faucet is left running, or if a child cries in the bedroom, the "hearing impaired" cannot hear them.
Project Design
This project will use AI (artificial intelligence) technology to allow the model to learn various sounds. At the same time, the sound will be captured by UNIHIKER, and through IOT (Internet of Things), the corresponding text message will be sent to the Arduino board to be shown on the display and reminded by lights. In addition, the watch made by micro:bit will be reminded by text, light and vibration, so that the "hearing-impaired" can see and feel the sound. This project uses Spectrogram to make the trained model recognize more kinds of sounds and with higher accuracy.
Audio Signal
Sound is represented in the form of audio signals, which have parameters such as frequency, bandwidth, decibels, etc. Audio signals can generally be expressed as a function of amplitude and time. These sounds come in a variety of formats, so computers can read and analyze them. For example: mp3 format, WMA (Windows Media Audio) format, wav (Waveform Audio File) format.
Spectrogram
Spectrogram is a kind of speech spectrogram invented during World War II, which is generally obtained by processing the received time-domain signal to obtain the spectrogram.
Spectrogram is a spectral analysis view, and is called a spectrogram if it is for speech data. The horizontal coordinate of the speech spectrogram is time, the vertical coordinate is frequency. The coordinate point value is speech data energy. Since a two-dimensional plane is used to express three-dimensional information, the magnitude of the energy value is indicated by the color, and a darker color means the stronger the speech energy at that point.
Time-domain analysis and frequency-domain analysis of speech are two important methods of speech analysis, but both have limitations. Time-domain analysis has no intuitive understanding of the frequency characteristics of the speech signal. There is no relationship between the speech signal over time in the frequency-domain analysis. Speech spectrogram combines the advantages of time and frequency domains, clearly shows the change of speech spectrum over time. The horizontal axis of the spectrogram is time, the vertical axis is frequency. The strength of any given frequency component at a given moment is indicated by the color shade. Darker colors have larger spectral values, while lighter colors have smaller spectral values. Different degrees of black and white on the spectrogram form different patterns called voiceprints, which are different for different speakers and can be used for voiceprint recognition.
STEP1 Record Audio
Use pyaudio library to record audio and generate wav files. PyAudio provides Python language version of PortAudio, a cross-platform audio I/O library. With PyAudio you can play and record audio in Python programs. This provides Python bindings for PoTaTudio, a cross-platform audio I/O library. With PyAudio you can easily play and record audio on a variety of platforms using Python.
To test the program, use pyaudio to record a 5-second sound file "output.wav".
import pyaudio
import wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "output.wav"
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
print("* recording")
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()
STEP2 Batch Generation of Speech Maps
Use the Librosa library to batch generate spectrograms of all kinds of sounds, such as knocking at the door, running water from the faucet, babies crying, alarms, and more.
Librosa is a python toolkit for audio, music analysis, processing, some common time-frequency processing, feature extraction, drawing sound graphics and other features are available, very powerful.
Librosa Spectrogram
librosa.display.specshow(data, x_axis=None, y_axis=None, sr=22050, hop_length=512)
Parameters:
data: matrix to be displayed
sr: sampling rate
hop_length: frame shift
x_axis 、y_axis:x-axis and y-axis range
Frequency type:
'linear', 'fft', 'hz': frequency range determined by FFT window and sampling rate
'log': the spectrum is displayed on a logarithmic scale
'mel': the frequency is determined by the mel scale
Time type:
time: markers are displayed in milliseconds, seconds, minutes or hours. Values are plotted in seconds.
s: marker is displayed in seconds.
ms: marker is displayed in milliseconds.
All frequency types are plotted in Hz.
import pyaudio
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import librosa.display
import pandas as pd
import librosa
def record_audio(record_second):
global wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
audio_data = []
print("* recording")
for i in tqdm(range(0, int(RATE / CHUNK * record_second))):
data = stream.read(CHUNK)
audio_data.append(data)
audio_samples = librosa.core.samples_to_frames(
np.frombuffer(b''.join(audio_data), dtype=np.int16),
CHANNELS,
hop_length=CHUNK
)
wave = audio_samples[5500:5500+int(1 * RATE)]/2**16
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
for i in range(50):
record_audio(record_second=2)
window_size = 1024
window = np.hanning(window_size)
stft = librosa.core.spectrum.stft(wave, n_fft=window_size, hop_length=512, window=window)
out = 2 * np.abs(stft) / np.sum(window)
# For plotting headlessly
fig = plt.figure(figsize=(2.24, 2.24))
ax = fig.add_subplot(111)
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)[indent]
p = librosa.display.specshow(librosa.amplitude_to_db(out, ref=np.max), ax=ax, y_axis='log', x_axis='time')
fig.savefig('save'+str(i)+'.jpg')
Note: You need to create a corresponding folder in the directory where the program is located, such as "door", and then copy the generated images to your computer to train the model using ML.
STEP3 Hardware Connection
1. UNIHIKER
Button---Pin21(for closing reminder)
LED light----Pin22 (for lighting reminder)
2.“mPython” Watch
Connect the vibration motor to the M2 connector and stick it on the strap. When the mPython receives a message, it will activate the vibration motor and start to vibrate, reminding the "hearing impaired person" to check the message on the screen.
STEP4 Model Training
Upload the image to the AI training platform for model training. Aimaker.space AI training platform is used here. The types are [size=18.6667px]"background"[size=18.6667px], "door", "water ".
STEP5 Computerized Reasoning Test
Download the model and put it in the corresponding directory of the program
import pyaudio
#from unihiker import Audio
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image, ImageOps #Install pillow instead of PIL
from tqdm import tqdm
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import librosa.display
import pandas as pd
import librosa
from keras.models import load_model
np.set_printoptions(suppress=True)
model = load_model('keras_model.h5', compile=False)
class_names = ['background','jianpan','desktop']
data = np.ndarray(shape=(1, 224, 224, 3), dtype=np.float32)
def record_audio(record_second):
global wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
audio_data = []
print("* recording")
for i in tqdm(range(0, int(RATE / CHUNK * record_second))):
data = stream.read(CHUNK)
audio_data.append(data)
audio_samples = librosa.core.samples_to_frames(
np.frombuffer(b''.join(audio_data), dtype=np.int16),
CHANNELS,
hop_length=CHUNK
)
wave = audio_samples[5500:5500+int(1 * RATE)]/2**16
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
while(True):
[/indent][indent]
record_audio(record_second=2)
window_size = 1024
window = np.hanning(window_size)
stft = librosa.core.spectrum.stft(wave, n_fft=window_size, hop_length=512, window=window)
out = 2 * np.abs(stft) / np.sum(window)
fig = plt.figure(figsize=(2.24, 2.24))
ax = fig.add_subplot(111)[/indent][indent]
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)[/indent]
p = librosa.display.specshow(librosa.amplitude_to_db(out, ref=np.max), ax=ax, y_axis='log', x_axis='time')
fig.savefig('wave.jpg')
image = Image.open('wave.jpg').convert('RGB')
image_array = np.asarray(image)
normalized_image_array = (image_array.astype(np.float32) / 127.0) - 1
data[0] = normalized_image_array
prediction = model.predict(data)
index = np.argmax(prediction)
class_name = class_names[index]
confidence_score = prediction[0][index]
print('Class:', class_name)
print('Confidence score:', confidence_score)
STEP6 UNIHIKER Reasoning
Download the model and put it into the corresponding directory of UNIHIKER program.
Use the on-board microphone to collect sound, transform it into a picture using matplotlib, load the trained model "keras_model.h5" using keras, and predict the sound type. Light up the LEDs and send relevant information commands through IOT.
from unihiker import GUI
u_gui=GUI()
display=u_gui.draw_text(text="smart prompter",x=0,y=100,font_size=35, color="#0000FF")
content=u_gui.draw_text(text="loading librares",x=20,y=180,font_size=35, color="#0000FF")
from pinpong.extension.unihiker import *
from pinpong.board import Board,Pin,NeoPixel
import pyaudio
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image, ImageOps
from tqdm import tqdm
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import librosa.display
import pandas as pd
import librosa
from keras.models import load_model
import siot
Board().begin()
p_p22_out=Pin(Pin.P22, Pin.OUT)
p_p21_in=Pin(Pin.P21, Pin.IN)
np1 = NeoPixel(p_p22_out,1)
np1[0] = (0,0,0)
np.set_printoptions(suppress=True)
model = load_model('keras_model.h5', compile=False)
class_names = ['background','door','water']
data = np.ndarray(shape=(1, 224, 224, 3), dtype=np.float32)
content.config(text="Connecting the IoT")
siot.init(client_id="",server="iot.dfrobot.com.cn",port=1883,user="X8jykxFnR",password="u8jskbFngz")
siot.connect()
siot.loop()
def record_audio(record_second):
global wave
CHUNK = 1024
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
p = pyaudio.PyAudio()
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
audio_data = []
print("* recording")
for i in tqdm(range(0, int(RATE / CHUNK * record_second))):
data = stream.read(CHUNK)
audio_data.append(data)
audio_samples = librosa.core.samples_to_frames(
np.frombuffer(b''.join(audio_data), dtype=np.int16),
CHANNELS,
hop_length=CHUNK
)
wave = audio_samples[5500:5500+int(1 * RATE)]/2**16
print("* done recording")
stream.stop_stream()
stream.close()
p.terminate()
while(True):
if (p_p21_in.read_digital()==True):
siot.publish(topic="1DXAmWJ4g", data="S")
np1[0] = (0,0,0)
else:
content.config(text="recording……")
record_audio(record_second=2)
content.config(text="recognizing……")
window_size = 1024
window = np.hanning(window_size)
stft = librosa.core.spectrum.stft(wave, n_fft=window_size, hop_length=512, window=window)
out = 2 * np.abs(stft) / np.sum(window)
fig = plt.figure(figsize=(2.24, 2.24))
ax = fig.add_subplot(111)
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)
p = librosa.display.specshow(librosa.amplitude_to_db(out, ref=np.max), ax=ax, y_axis='log', x_axis='time')
fig.savefig('wave.jpg')
image = Image.open('wave.jpg').convert('RGB')
image_array = np.asarray(image)
normalized_image_array = (image_array.astype(np.float32) / 127.0) - 1
data[0] = normalized_image_array
prediction = model.predict(data)
index = np.argmax(prediction)
class_name = class_names[index]
confidence_score = prediction[0][index]
if class_name=='background':
siot.publish(topic="1DXAmWJ4g", data="B")
np1[0] = (0,0,0)
content.config(text='background music')
elif class_name=='door':
siot.publish(topic="1DXAmWJ4g", data="D")
np1[0] = (0,255,0)
content.config(text='knock at the door')
elif class_name=='water':
siot.publish(topic="1DXAmWJ4g", data="W")
np1[0] = (0,0,255)
content.config(text='water running')
print('Class:', class_name)
print('Confidence score:', confidence_score)
Demo
1.Knockout Demo
2.Streaming Demo
Conclusion
This project brings visibility to sound for the hearing-impaired, improving their lives by using AI and IoT technologies. The model recognizes various sounds and transforms them into visual and tactile alerts. Spectrograms enhance sound recognition accuracy, and AI training platforms, PyAudio, and Librosa capture and process audio data efficiently.
Integrating this technology with hardware such as UNIHIKER, Arduino, and a micro:bit watch enables real-time notifications through visual, auditory, and tactile alerts. This innovative approach bridges the gap between the hearing-impaired and the auditory world, offering them greater safety and independence.
This article was first published on https://mc.dfrobot.com.cn/thread-315298-1-1.html on Feb 22, 2023
Author:云天
Feel free to join our UNIHIKER Discord community! You can engage in more discussions and share your insights!