Local Voice Assistant Step 2: Speech to Text and back
Having setup an ATOM Echo Voice Satellite and hooked it up to Home Assistant we now need to actually do something with the captured audio. Home Assistant largely deals with voice assistants using the Wyoming Protocol, which describes itself as essentially JSONL + PCM audio. It works nicely in terms of meaning everything can exist as separate modules that then just communicate over network sockets, and there are a whole bunch of Python implementations of the pieces necessary.
The first bit I looked at was speech to text; how do I get what I say to the voice satellite into something that Home Assistant can try and parse? There is a nice self contained speech recognition tool called whisper.cpp, which is a low dependency implementation of inference using OpenAI’s Whisper model. This is wrapped up for Wyoming as part of wyoming-whisper-cpp. Here we get into something that unfortunately seems common in this space; the repo contains a forked copy of whisper.cpp with enough differences that I couldn’t trivially make it work with regular whisper.cpp. That means missing out on new development, and potential improvements (the fork appears to be at v1.5.4, upstream is up to v1.7.5 at the time of writing). However it was possible to get up and running easily enough.
[I note there is a Wyoming Whisper API client that can use the whisper.cpp server, and that might be a cleaner way to go in the future, especially if whisper.cpp ends up in Debian.]
I stated previously I wanted all of this to be as clean an installed on Debian stable as possible. Given most of this isn’t packaged, that’s meant I’ve packaged things up as I go. I’m not at the stage anything is suitable for upload to Debian proper, but equally I’ve tried to make them a reasonable starting point. No pre-built binaries available, just Salsa git repos. https://salsa.debian.org/noodles/wyoming-whisper-cpp in this case. You need python3-wyoming from trixie if you’re building for bookworm, but it doesn’t need rebuilt.
You need a Whisper model that’s been converts to ggml format; they can be found on Hugging Face. I’ve ended up using the base.en
model. I found small.en
gave more accurate results, but took a little longer, when doing random testing, but it doesn’t seem to make much of a difference for voice control rather than plain transcribing.
[One of the open questions about uploading this to Debian is around the use of a prebuilt AI model. I don’t know what the right answer is here, and whether the voice infrastructure could ever be part of Debian proper, but the current discussion on the interpretation of the DFSG on AI models is very relevant.]
I run this in the same container as my Home Assistant install, using a systemd unit file dropped in /etc/systemd/system/wyoming-whisper-cpp.service
:
[Unit]
Description=Wyoming whisper.cpp server
After=network.target
[Service]
Type=simple
DynamicUser=yes
ExecStart=wyoming-whisper-cpp --uri tcp://localhost:10030 --model base.en
MemoryDenyWriteExecute=false
ProtectControlGroups=true
PrivateDevices=false
ProtectKernelTunables=true
ProtectSystem=true
RestrictRealtime=true
RestrictNamespaces=true
[Install]
WantedBy=multi-user.target
It needs the Wyoming Protocol integration enabled in Home Assistant; you can “Add Entry” and enter localhost + 10030 for host + port and it’ll get added. Then in the Voice Assistant configuration there’ll be a whisper.cpp
option available.
Text to speech turns out to be weirdly harder. The right answer is something like Wyoming Piper, but that turns out to be hard on bookworm. I’ll come back to that in a future post. For now I took the easy option and used the built in “Google Translate” option in Home Assistant. That needed an extra stanza in configuration.yaml
that wasn’t entirely obvious:
media_source:
With this, and the ATOM voice satellite, I could now do basic voice control of my Home Assistant setup, with everything except the text-to-speech piece happening locally! Things such as “Hey Jarvis, turn on the study light” work out of the box. I haven’t yet got into defining my own phrases, partly because I know some of the things I want (“What time is it?”) are already added in later Home Assistant versions than the one I’m running.
Overall I found this initially complicated to setup given my self-imposed constraints about actually understanding the building blocks and compiling them myself, but I’ve been pretty impressed with the work that’s gone into it all. Next step, running a voice satellite on a Debian box.