My Journey to a reliable and enjoyable locally hosted voice assistant (2025) (community.home-assistant.io)

by Vaslo 140 comments 425 points
Read article View on HN

140 comments

[−] hamdingers 61d ago
If you're less concerned about privacy, I use Gemini 2.5 Flash for this and it's exceptionally good and fast as a HA assistant while being much cheaper than the electricity that would be needed to keep a 3090 awake.

The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.

Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.

[−] acidburnNSA 61d ago
On the plus side, mine misdetected a wake word during a funny conversation and said "Sorry, I can't find any area called _____[60 second repeat of funny conversation]___" and it made my family laugh harder than we've laughed in a really long time. I even went into the tts cache and saved the wav b/c it was sooo funny.
[−] _spduchamp 61d ago
How about a button?

I'd prefer to physically press a button on an intercom box than having something churning away constantly processing sound.

[−] jcims 61d ago
I have a feeling beamforming microphone arrays might help here, something like this could improve the audio being processed substantially - https://www.minidsp.com/products/usb-audio-interface/uma-8-m....
[−] ethagnawl 61d ago
What's been surprising in my experience regarding the wake word is that it recognizes me (adult male) saying the wake word ~95% of the time. However, it only registers the rest of my family (women and children) ~30% of the time.
[−] robotswantdata 61d ago
What about your wifi APs sensing which room you are in, with your choice of hilarious dance moves as the trigger ?

Funky chicken for Gemini

Penguin dance for OpenAI

Claude?

[−] senkora 61d ago
Why not use an easier to detect wake “word”, like two claps in quick succession? Or a couple of notes of a melody?
[−] stavros 61d ago
[−] pjc50 60d ago
Wake word detection in low power DSP is a not-quite-COTS product but definitely exists. I believe PC manufacturers are looking at adding it to laptops soon, precisely to use with AI assistants.
[−] lostmsu 60d ago
Why do you even need a wake word? Have a model look at full transcript and decide when to engage.
[−] homeonthemtn 60d ago
How are you using Gemini in HA?
[−] tkems 61d ago
One that I have been experimenting with is using analog phones (including rotary ones!) to act as the satellites. I live in an older home and have phone jacks in most of the rooms already so I only had to use a single analog telephone adapter. [0] The downside is I don't have wake word support, but it makes it more private and I don't find myself missing my smart speakers that much. At some point I would like to also support other types of calls on the phones, but for now I need to get an LLM hooked up to it.

[0] https://www.home-assistant.io/voice_control/worlds-most-priv...

[−] ljclifford 61d ago
actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.

the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.

the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.

btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.

[1] https://github.com/coqui-ai/TTS [2] https://rime.ai

[−] voidUpdate 61d ago
Do people like talking to voice assistants? I've used one occasionally (mostly for timers when I'm cooking), but most of the time it would be faster for me to just do it myself, and feels much less awkward than talking to empty air, asking it to do things for me. It might be because I just really don't like making more noise than I have to

(Yes, I appreciate that some people may be disabled in such a way that it makes sense to use voice assistants, eg motor problems)

[−] yanis_t 61d ago
I'm still waiting till the promise of voice AI that was showed during the OpenAI demo in 2024 turn real somehow. It's not clear to me, why there has been zero progress since then.