Building an AI Assistant on Unihiker Single Board Computer with ChatGPT API and Azure Speech API

Afra Nov 15.2023

2 5260 Medium

Recently, I made a project to build an AI desktop assistant using OpenAI's GPT (Generative Pre-trained Transformer) technology and Azure Speech API. OpenAI GPT, a cutting-edge natural language processing model, can understand and generate human-like text, streamlining communication with computers. Coupled with Azure Speech API's voice recognition capabilities.

This project was inspired by a landmark event in the tech industry that took place on November 8th — OpenAI developers' conference. During the conference, OpenAI showed the cross-modal interaction and introduced the GPT store (GPTs), a collection of applications based on GPT technology, each offering expert-level services in their respective fields. The introduction of cross-modal interaction and the GPT store has opened up a new method for my AI desktop assistant project, enabling it to perform more complex tasks using natural language processing and speech interaction.

To bring this project to life, I've chosen the hardware to make it with Unihiker. This device has a built-in touchscreen, Wi-Fi, Bluetooth, and sensors for light, motion, and gyroscopic measurements. Its co-processor is ideally suited for interaction with external sensors and actuators, and the provided Python library significantly simplifies device control. In the following, I will introduce the development process of this AI desktop assistant, which integrates Microsoft Azure and OpenAI GPT, and share an alternative method for creating an intelligent voice desktop agent using only the OpenAI API.

Project code files can be downloaded from github： https://github.com/zzzqww/DFRobot/tree/main/Unihiker%2BGPT

Part One: Preparing the Hardware

1. Hardware

For the hardware integration of our miniaturized desktop product, I have chosen a 10cm power extension cord to facilitate a more rational design for the power charging port. At the same time, we have combined an amplifier with a dual-channel speaker to ensure high-quality sound playback despite the small size.

The connection interface of the amplifier needs to be directly soldered onto the solder points on the Unihiker, as shown in the following figure.

HARDWARE LIST

1 Unihiker

Link

1 3W Mini Audio Stereo Amplifier

Link

2 Speaker

Link

1 Type-C Extension Cable

Link

2. Model printing:

3D printing file link: https://www.thingiverse.com/thing:6307018

In terms of design, my model integrates the power port with the shape of the brand's IP, where the 'circle on top' serves as the power button.

During installation, to maximize the internal space, a specific sequence of assembly is required. After fixing the Speakers in its place, it should be lifted from the top, allowing the Unihiker to be inserted from the bottom. Once the Unihiker is secured at the screen's fixed position, the sound card can be pushed in, completing the successful installation.

Part 2: Software Programming

I. The first method: GPT & Azure

To implement this feature, we need to combine multiple libraries and APIs for speech recognition, text-to-speech, and interaction with the GPT model.

First we need to register azure and openai to ensure that ‘azure.speech_key’ and ‘openai.api_key’ could be used.
How to register Azure and get the API key, please check this tutorial: https://community.dfrobot.com/makelog-313501.html
How to find openai's api key, please check this link: https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key

After registering the APIs of these two platforms, check Azure’s ‘speech_key’ and ‘service_region’ and OpenAI’s ‘openai.api_key’. These two settings will be used later.

After getting the API, start writing python programs.

These functions are based on the network interface, so Unihiker needs to be connected to the network first.How to connect to unihiker to programmin, please check: https://www.unihiker.com/wiki/connection

1. Import libraries and modules:

- unihiker.Audio: Provides audio-related functions.
- unihiker.GUI: Create a graphical user interface (GUI).
- openai: used to interact with OpenAI models.
- time: Provides time-related functions.
- os: Provides operating system related functions.
- azure.cognitiveservices.speech.SpeechConfig: Provides speech-related functions.

Before you upload the code, ensure that the OpenAI library is installed on the Unihiker. Enter 'pip install openai' and 'pip install azure.cognitiveservices.speech' one by one in the terminal and wait for the installation to complete successfully, as shown in the following image.

CODE

from unihiker import Audio
from unihiker import GUI
import openai
import time
import os 
from azure.cognitiveservices.speech import SpeechConfig

2. Set the key and region:

Create an instance with your key and location/region.
- speech_key: Specifies the key for the Azure Speech service.
- service_region: Specifies the region/location of the Azure Voice service.

CODE

speech_key = "xxxxx" # Fill key 
service_region = "xxx" # Enter Location/Region

3. Set OpenAI API key:

- openai.api_key: Set the API key for interacting with the OpenAI GPT model.

CODE

openai.api_key = "xxxxxxxxxxx" #input OpenAI api key

4. Import Azure Speech SDK:

- The code attempts to import the azure. cognitiveservices.speech module and prints an error message if the import fails.

CODE

try:
    import azure.cognitiveservices.speech as speechsdk
except ImportError:

    print("""
    Importing the Speech SDK for Python failed.
    Refer to
    https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-python for
    installation instructions.
    """)
    import sys
    sys.exit(1)

5. Function: Recognize speech from default microphone

· Use default microphone to synthesize speech
· recognize_once_async：
Performs recognition in a non-blocking (asynchronous) mode. This will recognize a single utterance. The end of a single utterance is determined by listening for silence at the end or until a maximum of 15 seconds of audio is processed.

CODE

# speech to text
def recognize_from_microphone():
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    # Exception reminder
    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        return speech_recognition_result.text
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

6. tts(text): Use Azure Speech SDK to convert text to speech.

· Play speech using default speakers

CODE

# text to speech
def tts(text):
    speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, value='true')
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb)
    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()

7. speech_synthesizer_word_boundary_cb(evt):

callback function that handles word boundaries in the speech synthesis process. Achieve the effect of words appearing one after another.

CODE

# display text one by one
def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs):
    global text_display

    if not (evt.boundary_type == speechsdk.SpeechSynthesisBoundaryType.Sentence):
        text_result = evt.text
        text_display = text_display + "   " + text_result
        trans.config(text = text_display)
    
    if evt.text == ".":
        text_display = ""

8. askOpenAI(question): Send a question to the OpenAI GPT model and return the generated answer. (You can choose other versions of gpt models)

CODE

# openai
def askOpenAI(question):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages = question
    )
    return completion['choices'][0]['message']['content']

9. Event callback function:
- button_click1(): Set the flag variable to 1.
- button_click2(): Set the flag variable to 3.

CODE

def button_click1():
    global flag
    flag = 1

def button_click2():
    global flag
    flag = 3

10. Voice service configuration:
· voice_config: Configures the Azure Speech SDK using the provided voice key, region, language, and voice settings.
· Graphical user interface initialization:
· Text-to-speech functionality in Speech Services supports more than 400 voices and more than 140 languages and their variations. You can get the full list or try it in the speech library (https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts).

CODE

# speech service configuration
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_language = 'en-US'
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

(Speaking voices are determined in order of priority as follows:
· If SpeechSynthesisVoiceName or SpeechSynthesisLanguage is not set, the default sound is en-US.
· If only SpeechSynthesisLanguage is set, the default voice for the specified locale is spoken.
· If both SpeechSynthesisVoiceName and SpeechSynthesisLanguage are set, SpeechSynthesisLanguage ignores this setting. You speak using the voice specified by SpeechSynthesisVoiceName.
· If you use Speech Synthesis Markup Language (SSML) to set speech elements, the SpeechSynthesisVoiceName and SpeechSynthesisLanguage settings will be ignored. )

11. Initialize GUI and audio objects:
- u_gui: Create an instance of the GUI class from the unihiker library.
- u_audio: Create an instance of the Audio class from the unihiker library.
- Create and configure various GUI elements such as images, buttons and text.
- The screen resolution is 240x320, so the unihiker library resolution is also 240x320. The origin of the coordinates is the upper left corner of the screen, the positive direction of the x-axis is to the right, and the positive direction of the y-axis is downward.

CODE

u_gui=GUI()
u_audio = Audio()

# GUI initialization
img1=u_gui.draw_image(image="background.jpg",x=0,y=0,w=240)
button=u_gui.draw_image(image="mic.jpg",x=13,y=240,h=60,onclick=button_click1)
refresh=u_gui.draw_image(image="refresh.jpg",x=157,y=240,h=60,onclick=button_click2)
init=u_gui.draw_text(text="Tap to speak",x=27,y=50,font_size=15, color="#00CCCC")
trans=u_gui.draw_text(text="",x=2,y=0,font_size=12, color="#000000")
trans.config(w=230)
result = ""
flag = 0
text_display = ""

message = [{"role": "system", "content": "You are a helpful assistant."}]
user = {"role": "user", "content": ""}
assistant = {"role": "assistant", "content": ""}

12. Main loop: The code enters an infinite loop, constantly checking the value of the flag variable and performing corresponding operations based on its value.
When flag is 0, the GUI button is enabled.
When the flag is 1, the code listens for voice input from the microphone, adds the user's message to the message list, and updates the GUI with the recognized text.
When flag is 2, the code interacts with the OpenAI model by sending a list of messages and generating a response. The response is then synthesized into speech.
When flag is 3, the message list is cleared and a system message is added.

CODE

while True:
    if (flag == 0):
        button.config(image="mic.jpg",state="normal")
        refresh.config(image="refresh.jpg",state="normal")
        

    if (flag == 3):
        message.clear()
        message = [{"role": "system", "content": "You are a helpful assistant."}]

    if (flag == 2):
        azure_synthesis_result = askOpenAI(message)
        assistant["content"] = azure_synthesis_result
        message.append(assistant.copy())

        tts(azure_synthesis_result)
        time.sleep(1)
        

        flag = 0
        trans.config(text="      ")
        button.config(image="",state="normal")
        refresh.config(image="",state="normal")
        init.config(x=15)
    
    if (flag == 1):
        init.config(x=600)
        trans .config(text="Listening。。。")
        button.config(image="",state="disable")
        refresh.config(image="",state="disable")
        result = recognize_from_microphone()
        user["content"] = result
        message.append(user.copy())
        trans .config(text=result)
        time.sleep(2)
        trans .config(text="Thinking。。。")
        flag = 2

Complete code in the first method: GPT & Azure

CODE

from unihiker import Audio
from unihiker import GUI
import openai
import time
import os 
from azure.cognitiveservices.speech import SpeechConfig 

speech_key = "xxxxxxxxx" # Fill key 
service_region = "xxxxx" # Enter Location/Region 

openai.api_key = "xxxxxxxxxx" # inputOpenAI api key

try:
    import azure.cognitiveservices.speech as speechsdk
except ImportError:

    
    print("""
    Importing the Speech SDK for Python failed.
    Refer to
    https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-python for
    installation instructions.
    """)
    import sys
    sys.exit(1)


# Set up the subscription info for the Speech Service:
# Replace with your own subscription key and service region (e.g., "japaneast").


# speech to text
def recognize_from_microphone():
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    # Exception reminder
    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        # print("Recognized: {}".format(speech_recognition_result.text))
        return speech_recognition_result.text
    elif speech_recognition_result.reason == speechsdk.ResultReason.NoMatch:
        print("No speech could be recognized: {}".format(speech_recognition_result.no_match_details))
    elif speech_recognition_result.reason == speechsdk.ResultReason.Canceled:
        cancellation_details = speech_recognition_result.cancellation_details
        print("Speech Recognition canceled: {}".format(cancellation_details.reason))
        if cancellation_details.reason == speechsdk.CancellationReason.Error:
            print("Error details: {}".format(cancellation_details.error_details))
            print("Did you set the speech resource key and region values?")

# text to speech
def tts(text):
    speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_RequestSentenceBoundary, value='true')
    audio_config = speechsdk.audio.AudioOutputConfig(use_default_speaker=True)
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=audio_config)
    speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_cb)
    speech_synthesis_result = speech_synthesizer.speak_text_async(text).get()


# display text one by one
def speech_synthesizer_word_boundary_cb(evt: speechsdk.SessionEventArgs):
    global text_display

    if not (evt.boundary_type == speechsdk.SpeechSynthesisBoundaryType.Sentence):
        text_result = evt.text
        text_display = text_display + "   " + text_result
        trans.config(text = text_display)
    
    if evt.text == ".":
        text_display = ""



# openai
def askOpenAI(question):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages = question
    )
    return completion['choices'][0]['message']['content']



# speech service configuration
def button_click1():
    global flag
    flag = 1


def button_click2():
    global flag
    flag = 3


u_gui=GUI()
u_audio = Audio()


# speech service configuration
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.speech_synthesis_language = 'en-US'
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

# GUI initialization
img1=u_gui.draw_image(image="background.jpg",x=0,y=0,w=240)
button=u_gui.draw_image(image="mic.jpg",x=13,y=240,h=60,onclick=button_click1)
refresh=u_gui.draw_image(image="refresh.jpg",x=157,y=240,h=60,onclick=button_click2)
init=u_gui.draw_text(text="Tap to speak",x=27,y=50,font_size=15, color="#00CCCC")
trans=u_gui.draw_text(text="",x=2,y=0,font_size=12, color="#000000")
trans.config(w=230)
result = ""
flag = 0
text_display = ""

message = [{"role": "system", "content": "You are a helpful assistant."}]
user = {"role": "user", "content": ""}
assistant = {"role": "assistant", "content": ""}

while True:
    if (flag == 0):
        button.config(image="mic.jpg",state="normal")
        refresh.config(image="refresh.jpg",state="normal")
        

    if (flag == 3):
        message.clear()
        message = [{"role": "system", "content": "You are a helpful assistant."}]

    if (flag == 2):
        azure_synthesis_result = askOpenAI(message)
        assistant["content"] = azure_synthesis_result
        message.append(assistant.copy())

        tts(azure_synthesis_result)
        time.sleep(1)
        

        flag = 0
        trans.config(text="      ")
        button.config(image="",state="normal")
        refresh.config(image="",state="normal")
        init.config(x=15)
    
    if (flag == 1):
        init.config(x=600)
        trans .config(text="Listening。。。")
        button.config(image="",state="disable")
        refresh.config(image="",state="disable")
        result = recognize_from_microphone()
        user["content"] = result
        message.append(user.copy())
        trans .config(text=result)
        time.sleep(2)
        trans .config(text="Thinking。。。")
        flag = 2

On the basic of the above technical path, OpenAI's cross-modal capabilities further strengthen its ecosystem, allowing developers to develop openai-based applications more quickly. In terms of the performance of a single mode, the recently updated DALL·E 3 is not inferior to the previously leading Midjourney and Stable Diffusion in terms of visual effects. Combining visual capabilities, GPT-4, and text-to-speech conversion technologies TTS and Co-pilot in partnership with Microsoft, this cross-modal integration will greatly simplify the process of realizing complex logic and task execution through natural language communication. At the same time, GPT-4 has also received a major update. The new GPT-4 Turbo version supports users to upload external databases or files, handles context lengths of up to 128k (equivalent to a 300-page book), and the knowledge base has been updated to 2023 In April of this year, API prices were also heavily discounted.

II. The second method: OpenAI handle all (gpt+whisper+tts)

Through this interface integration, we can try to use all openai's APIs to implement this function. Using openai's "whisper + gpt + tts" can also achieve the above functions. The advantage is that you can only register openai and obtain the key to implement the function, and the language category can be automatically identified. However, openai's whisper cannot support real-time conversion for the time being, so there are still differences in code writing and program response.

· Initialize and record audio
· Use whisper model to record speech and convert it to text: The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats. File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.
· Use the gpt-3.5-turbo model to generate answers.
· Use the tts-1 model to convert text into speech and output audio files.
· Play audio files

Add more user-friendly interaction:

· Automatic text scrolling

· Automatic recognize language and reply

CODE

from unihiker import Audio
from unihiker import GUI
import openai
import time

openai.api_key = "xxxxxxxxxxx" # input OpenAI api key

# openai speech to text
def asr():
    audio_file= open("input.mp3", "rb")

    transcript = openai.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file,
    response_format="text"
    )

    return transcript


# openai text to speech
def tts(text):
    response = openai.audio.speech.create(
        model="tts-1",
        # Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) 
        voice="alloy",
        input=text,
    )

    response.stream_to_file("output.mp3")


# openai
def askOpenAI(question):
    completion = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages = question
    )
    # print(completion.choices[0].message.content)
    return completion.choices[0].message.content


# text display
def text_update():
    global y1
    time.sleep(16)
    while True:
        y1 -= 2
        time.sleep(0.15)
        trans.config(y = y1)

# 
def play_audio():
    global flag
    u_audio.play('output.mp3')
    u_gui.stop_thread(thread1)
    flag = 0


def monitor_silence():
    global is_recording, monitor_thread
    silence_time = 0

    while is_recording:
        sound_level = u_audio.sound_level()
        if sound_level < THRESHOLD:
            silence_time += 0.1
        else:
            silence_time = 0

        if silence_time >= SILENCE_DURATION:
            u_audio.stop_record()
            is_recording = 0
            u_gui.stop_thread(monitor_thread)
            
        time.sleep(0.1)  # detect once /0.1s

def start_recording_with_silence_detection(filename):
    global is_recording, monitor_thread
    is_recording = 1
    u_audio.start_record(filename)  # start record
    monitor_thread = u_gui.start_thread(monitor_silence)


# event callback function
def button_click1():
    global flag
    flag = 1
    

def button_click2():
    global flag
    flag = 3

def button_click3():
    global flag,thread1,thread2
    flag = 0
    u_gui.stop_thread(thread1)
    u_gui.stop_thread(thread2)





u_gui=GUI()
u_audio = Audio()


# GUI
img1=u_gui.draw_image(image="background.jpg",x=0,y=0,w=240)
button=u_gui.draw_image(image="mic.jpg",x=13,y=240,h=60,onclick=button_click1)
refresh=u_gui.draw_image(image="refresh.jpg",x=157,y=240,h=60,onclick=button_click2)
init=u_gui.draw_text(text="Tap to speak",x=27,y=50,font_size=15, color="#00CCCC")
trans=u_gui.draw_text(text="",x=5,y=0, color="#000000", w=230)
back=u_gui.draw_image(image="backk.jpg",x=0,y=268,onclick=button_click3)
DigitalTime=u_gui.draw_digit(text=time.strftime("%Y/%m/%d       %H:%M"),x=9,y=5,font_size=12, color="black")


result = ""
flag = 0
text_display = ""
y1 = 0

message = [{"role": "system", "content": "You are a helpful assistant."}]
user = {"role": "user", "content": ""}
assistant = {"role": "assistant", "content": ""}

# Threshold setting, the specific value needs to be adjusted according to the actual situation
THRESHOLD = 20  # Assuming this is the detected silence threshold
SILENCE_DURATION = 2  # 2 seconds silent time

# Recording control variables
is_recording = 0


while True:
    if (flag == 0):
        button.config(image="mic.jpg",state="normal")
        refresh.config(image="refresh.jpg",state="normal")
        back.config(image="",state="disable")
        DigitalTime.config(text=time.strftime("%Y/%m/%d       %H:%M"))
        

    if (flag == 3):
        message.clear()
        message = [{"role": "system", "content": "You are a helpful assistant."}]

    if (flag == 2):
        DigitalTime.config(text=time.strftime(""))
        azure_synthesis_result = askOpenAI(message)
        assistant["content"] = azure_synthesis_result
        message.append(assistant.copy())
        tts(azure_synthesis_result)
        trans.config(text=azure_synthesis_result)
        back.config(image="backk.jpg",state="normal")

        thread1=u_gui.start_thread(text_update)
        thread2=u_gui.start_thread(play_audio)


        while not (flag == 0):
            pass

        y1 = 0
        trans.config(text="      ", y = y1)
        button.config(image="",state="normal")
        refresh.config(image="",state="normal")
        init.config(x=15)
    
    if (flag == 1):
        DigitalTime.config(text=time.strftime(""))
        is_recording = 1
        init.config(x=600)
        trans .config(text="Listening。。。")
        start_recording_with_silence_detection('input.mp3')
        button.config(image="",state="disable")
        refresh.config(image="",state="disable")
        back.config(image="",state="disable")

        while not ((is_recording == 0)):
            pass

        back.config(image="",state="disable")
        result = asr()
        user["content"] = result
        message.append(user.copy())
        trans .config(text=result)
        time.sleep(2)
        trans .config(text="Thinking。。。")
        flag = 2

Through the above two methods, we implemented the gpt voice agent assistant function. The integration of Azure Speech API and OpenAI GPT has opened up a new frontier for developing intelligent desktop assistants. The advancements in natural language processing and speech recognition technologies are making our interactions with computers more natural and efficient. By harnessing the power of these technologies, we can build applications that can perform complex tasks and provide expert-level services in their respective fields. In the upcoming posts, I will continue to share the development process of this intelligent assistant and explore more possibilities that this integration can bring.

GPTs and rich APIs enable us to easily develop and implement personalized intelligent agents. An intelligent agent can be understood as a program that can simulate human intelligent behavior when interacting with the external environment. For example, the control system of a self-driving car is an intelligent agent. At the developer conference, OpenAI employees showed an example: uploading a PDF of flight information in one second, and the intelligent agent can sort out the ticket information and display it on the web page. If combined with more different hardware interfaces, we can try to customize more of our own gpt applications without the computer or mobile phone interface. Just like the application case of this smart desktop assistant, more physical-level intelligent controls can be expanded in the future to achieve a more natural ‘intelligent agent’.

The future we once envisioned for AI agents has now become a reality.

License

All Rights

Reserved