If you're less concerned about privacy, I use Gemini 2.5 Flash for this and it's exceptionally good and fast as a HA assistant while being much cheaper than the electricity that would be needed to keep a 3090 awake.
The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.
Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.
On the plus side, mine misdetected a wake word during a funny conversation and said "Sorry, I can't find any area called _____[60 second repeat of funny conversation]___" and it made my family laugh harder than we've laughed in a really long time. I even went into the tts cache and saved the wav b/c it was sooo funny.
What's been surprising in my experience regarding the wake word is that it recognizes me (adult male) saying the wake word ~95% of the time. However, it only registers the rest of my family (women and children) ~30% of the time.
Wake word detection in low power DSP is a not-quite-COTS product but definitely exists. I believe PC manufacturers are looking at adding it to laptops soon, precisely to use with AI assistants.
One that I have been experimenting with is using analog phones (including rotary ones!) to act as the satellites. I live in an older home and have phone jacks in most of the rooms already so I only had to use a single analog telephone adapter. [0] The downside is I don't have wake word support, but it makes it more private and I don't find myself missing my smart speakers that much. At some point I would like to also support other types of calls on the phones, but for now I need to get an LLM hooked up to it.
actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
Do people like talking to voice assistants? I've used one occasionally (mostly for timers when I'm cooking), but most of the time it would be faster for me to just do it myself, and feels much less awkward than talking to empty air, asking it to do things for me. It might be because I just really don't like making more noise than I have to
(Yes, I appreciate that some people may be disabled in such a way that it makes sense to use voice assistants, eg motor problems)
I'm still waiting till the promise of voice AI that was showed during the OpenAI demo in 2024 turn real somehow. It's not clear to me, why there has been zero progress since then.
140 comments
The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.
Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.
I'd prefer to physically press a button on an intercom box than having something churning away constantly processing sound.
Funky chicken for Gemini
Penguin dance for OpenAI
Claude?
[0] https://www.home-assistant.io/voice_control/worlds-most-priv...
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
[1] https://github.com/coqui-ai/TTS [2] https://rime.ai
(Yes, I appreciate that some people may be disabled in such a way that it makes sense to use voice assistants, eg motor problems)