Show HN: Ghost Pepper – Local hold-to-talk speech-to-text for macOS (github.com)

by MattHart88 200 comments 467 points
Read article View on HN

200 comments

[−] arkensaw 39d ago
This is great, and I'm not knocking it, but every time I see these apps it reminds me of my phone.

My 2021 Google Pixel 6, when offline, can transcribe speech to text, and also corrects things contextually. it can make a mistake, and as I continue to speak, it will go back and correct something earlier in the sentence. What tech does Google have shoved in there that predates Whisper and Qwen by five years? And why do we now need a 1Gb of transformers to do it on a more powerful platform?

[−] pushedx 38d ago
It's the same model used for the WebSpeech API, which can operate entirely offline.

Google mostly funded the training of this model around 10 years ago, and it's quite good.

There are many websites that are simple frontends for this model which is built into Webkit and Blink based browsers. However to my knowledge the model is a blob packed into the apps which is not open source, hence the no Firefox support.

https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...

https://www.google.com/intl/en/chrome/demos/speech.html

[−] com2kid 39d ago
Microsoft OneNote had this back in 2007 or so, granted the speech to text model wasn't nearly as advanced as they are now.

I was actually on the OneNote team when they were transitioning to an online only transcription model because there was no one left to maintain the on device legacy system.

It wasn't any sort of planned technical direction, just a lack of anyone wanting to maintain the old system.

[−] rudhdb773b 38d ago
I remember trying out some voice-to-text around 2002 that I believe was included with Windows XP.. or maybe Office?

You had to go through some training exercises to tune it to your voice, but then it worked fairly well for transcription or even interacting with applications.

[−] silon42 38d ago
OS/2 had it built in in 1996.
[−] adamsmark 39d ago
The accuracy is much lower though.

I've switched away from Gboard to Futo on Android and exclusively use MacWhisper on MacOS instead of the default Apple transcription model.

[−] dotancohen 39d ago
Any particular reason why you switched? I've been using Gboard for years, especially the text to speech in four languages. In the past few weeks, there was an update where the TTS feature is now in a separate "panel" of the keyboard, and it hardly works at all.

In English and Hebrew it stops after half a dozen words, and those words must be spoken slowly and mechanically for it to work at all. Russian and Arabic are right out - I can't coax any coherent sentence out of it.

I've gone through all permutations of relevant settings, such as "Faster Voice Dictation" (translated from Hebrew,I don't know what the original English option is called). I think there used to be an option for Online or Offline transcription, but that option is gone now.

This is ridiculous - I tried to copy the version information and there is no way to copy it in-app. Let's try the S24 OCR feature...

17.0.10.880768217 release-arm64-v8a 175712590 ראשית (en_GB) 2025090100 = גרסה עדכני Primary on-device: No packs Fallback on-device: Packs: ru-RU: 200

I'll try to install the English, Hebrew, and Arabic packs, though I'm certain that I've installed them already.

[−] cootsnuck 39d ago
Interesting. My Pixel 7 transcription is barely usable for me. Makes way too many mistakes and defeats the purpose of me not having to type, but maybe that's just my experience.

The latest open source local STT models people are running on devices are significantly more robust (e.g. whisper models, parakeet models, etc.). So background noise, mumbling, and/or just not having a perfect audio environment doesn't trip up the SoTA models as much (all of them still do get tripped up).

I work in voice AI and am using these models (both proprietary and local open source) every day. Night and day different for me.

[−] taffydavid 38d ago
I've built my own tts apps testing whisper and while it's good it does hallucinate quite a bit if there's noise, or just sometimes when the audio is perfectly clear.

It often gives the illusion of being very good but I could record a half hour of me speaking and discover some very random stuff in the middle that I did not say

[−] cootsnuck 38d ago
Yup, you're absolutely right. The open source models do have their rough edges. I use NVIDIA's Parakeet v3 model a lot locally, and it will occasionally do this thing where it just repeats a word like a dozen times.
[−] artdigital 38d ago
macOS and iOS can do that to with the baked in dictation. Globe key + D on Mac
[−] dust42 38d ago
When you activate it you agree that your voice input is sent to Apple. As far as I understand this project runs fully locally. Up to you to decide for whatever suits your needs best.
[−] stingraycharles 38d ago
Where did you get from that the voice input is sent to Apple / the cloud?

As far as I understand Apple’s voice model runs locally for most languages.

Siri commands can be used for training, but is also executed locally and sent to Apple separately (and this can be disabled).

[−] angristan 38d ago
I couldn't believe it either but when you enable it the settings of macOS you get this popup:

> When you dictate text, information like your voice input and contact names are sent to Apple to help your Mac recognize what you’re saying.

[−] wat10000 38d ago
Elsewhere it says:

"When you use Dictation, your device will indicate in Keyboard Settings if your audio and transcripts are processed on your device and not sent to Apple servers. Otherwise, the things you dictate are sent to and processed on the server, but will not be stored unless you opt in to Improve Siri and Dictation."

And:

"Dictation processes many voice inputs on your Mac. Information will be sent to Apple in some cases."

In conclusion... I think they're trying to cover all their bases, but it sounds like things are processed locally as long as the hardware can handle it.

[−] victorbjorklund 38d ago
No, that is not correct. It is running one hundred percent local. You can try it by turning off internet on your phone and try running it then. However, the built in model isn't as good, so this is probably better.
[−] dwayne_dibley 38d ago
yup, this is how I 'type'
[−] nidnogg 38d ago
Nothing comes close to LLM transcription though. I just tried this. I said "globe key dictation, does this work?". Here's the transcription, verbatim:

"Fucking dictation, does this work"

[−] arkensaw 36d ago
fun fact: voice typing also worked excellently on Windows Phone, although only in the SMS app
[−] vharish 38d ago
IMO.. one of the best. It was surprisingly good. Yet they can't even replicate in on their own systems
[−] atlgator 39d ago
This thread is a support group for people who have each independently built the same macOS speech-to-text app.
[−] theturtletalks 39d ago
I'm tracking them all here:

https://opensource.builders/alternatives/superwhisper

Just added Ghost Pepper, and you can actually create a skill.md with the features you need to build your own

[−] bytesandbits 39d ago
Handy with parakeet is pretty awesome by the way!
[−] perelin 39d ago
Agree. Slept on.

Wish they would do an ios version, but the creator already kind of dismissed it.

[−] MegagramEnjoyer 39d ago
i like handy a lot, so clean
[−] dnlzro 39d ago
Another one to add (1.5k stars on GitHub): https://github.com/kitlangton/Hex
[−] earthnail 38d ago
Please add wordbird as well: https://github.com/tillahoffmann/wordbird

It has all the usual features, plus you can add project specific vocabulary in your repo. It detects the working folder based on the active window, reads a WORDBIRD.md file in that folder and corrects terms accordingly.

(My friend Till built it)

[−] Barbing 38d ago
Very nice. Two great features I'd suggest highlighting in two apps, one app of which you have listed.

1: livestream transcript directly into the cursor in real time (just like native macOS dictation)

2: show realtime transcript live in an overlay (still has to paste when done, unlike #1, but can still read live while dictating)

1- localvoxtral, 2- FluidVoice (bumping it to 7 features on your list)

[−] zgougou123 38d ago
You could add foxsay, a great one : https://github.com/skulkworks/foxsay
[−] raybb 38d ago
Do any of the apps support taking actions as you talk without having to hit stop?

Like telling it to edit the text or remove a word.

[−] foltik 39d ago
So... a vibe slop index to keep track of all the vibe slop apps?

The cherry on top: it’s completely broken! Enable the Context Awareness filter, the list shrinks. Now enable the Auto-pasting filter, the list grows back.

[−] v4nn4 38d ago
The filters selection seems to return a union not an intersection which is a bit confusing, at least to me.
[−] lizhang 39d ago
[−] karimf 39d ago
In the /r/macapps subreddit, they have huge influx of new apps posts, and the "whisper dictation" is one of the most saturated category. [0]

>“Compare” - This is the most important part. Apps in the most saturated categories (whisper dictation, clipboard managers, wallpaper apps, etc.) must clearly explain their differentiation from existing solutions.

https://www.reddit.com/r/macapps/comments/1r6d06r/new_post_r...

[−] tpowell 39d ago
I cobbled my own together one night before I came across the thoughtfully-built KeyVox and got to talking shop with its creator. Our cups runneth over. https://github.com/macmixing/keyvox/
[−] aroman 39d ago
I did mine on nixOS with a nice little indicator built into Noctalia.

It's remarkable how similar its performance is to Wispr Flow... and it runs locally...

[−] fragmede 39d ago
Yeah, but mine... Oh. Hello. sighs It's been three weeks since I tried to add feature to my version of the app. I don't miss it. I like this new life. Sober.
[−] hbbio 39d ago
In the most possible Apple fashion, I am waiting for MacOS 27 or 28 to have this builtin.
[−] colechristensen 39d ago
My name is Cole and I have a speech to text app.

When I most recently abandoned it, the trigger word would fire one time in five.

[−] lxe 39d ago
hahaha I’m glad I’m just a procedurally generated NPC

I built one for cross platform — using parakeet mlx or faster whisper. :)

[−] brcmthrowaway 39d ago
Oh to be 20-something and do a bunch of free work for your portfolio again
[−] perelin 39d ago
I recently attended a agentic SWE workshop and the starter project was this, whispr style, local voice dictation app. Took everybody around 30mins. tbh: i was kinda impressed.
[−] dcreater 39d ago
Its gotten so bad that its a meme on the macapps subreddit.

This is the unfortunate real face of open source. So many devs each making little sandcastles on their own when if efforts were combined, we could have had something truly solid and sustainable, instead of a litany of 90% there apps each missing something or the other, leaving people ending up using WisprFlow etc.

[−] nidnogg 38d ago
NGL had me chuckling a bit there when I remembered I had one of these to code on my backlog
[−] pmarreck 39d ago
Are there any better than Superwhisper? Because I haven't found any.
[−] rmac 39d ago
checking in

windows (kotlin multi platform) => https://github.com/maceip/daydream

parakeet-tdt-0.6b-v2

[−] jannniii 39d ago
github.com/randomm/kuiskaus
[−] seivan 39d ago
[dead]
[−] goodroot 39d ago
Nice one! For Linux folks, I developed https://github.com/goodroot/hyprwhspr.

On Linux, there's access to the latest Cohere Transcribe model and it works very, very well. Requires a GPU though. Larger local models generally shouldn't require a subordinate model for clean up.

Have you compared WhisperKit to faster-whisper or similar? You might be able to run turbov3 successfully and negate the need for cleanup.

Incidentally, waiting for Apple to blow this all up with native STT any day now. :)

[−] primaprashant 39d ago
Speech-to-text has become integral part of my dev flow especially for dictating detailed prompts to LLMs and coding agents.

I have collected the best open-source voice typing tools categorized by platform in this awesome-style GitHub repo. Hope you all find this useful!

https://github.com/primaprashant/awesome-voice-typing

[−] cupcake-unicorn 39d ago
https://handy.computer/ already exists?
[−] charlietran 39d ago
Thank you for sharing, I appreciate the emphasis on local speed and privacy. As a current user of Hex (https://github.com/kitlangton/Hex), which has similar goals, what are your thoughts on how they compare?
[−] parhamn 39d ago
I see a lot of whisper stuff out there. Are these the same old OpenAI whispers or have they been updated heavily?

I've been using parakeet v3 which is fantastic (and tiny). Confused why we're still seeing whisper out there, there's been a lot of development.

[−] konaraddi 39d ago
That’s awesome! Do you know how it compares to Handy? Handy is open source and local only too. It’s been around a while and what I’ve been using.

https://github.com/cjpais/handy

[−] ericmcer 39d ago
I see quite a few of these, the killer feature to me will be one that fine tunes the model based on your own voice.

E.G. if your name is Donold (pronounced like Donald) there is not a transcription model in existence that will transcribe your name correctly. That means forget inputting your name or email ever, it will never output it correctly.

Combine that with any subtleties of speech you have, or industry jargon you frequently use and you will have a much more useful tool.

We have a ton of options for "predict the most common word that matches this audio data" but I haven't found any "predict MY most common word" setups.

[−] ipsum2 39d ago
Parakeet is significantly more accurate and faster than Whisper if it supports your language.
[−] kushalpandya 39d ago
Speecg-to-text is basically AI version of Todo app that we used to build every week when new frontend framework would release.
[−] __mharrison__ 39d ago
Cool, I've been doing a lot of "coding" (and other typing tasks) recently by tapping a button on my Stream Deck. It starts recording me until I tap it again. At which point, it transcribes the recording and plops it into the paste buffer.

The button next to it pastes when I press it. If I press it again, it hits the enter command.

You can get a lot done with two buttons.

[−] nidnogg 38d ago
This got me thinking that the smaller these local first LLMs get - the more they're gonna looking the next bread and butter of app dev. Reminds me how Electron gained a lot of traction for making it easy to package prettier apps. At the measly cost of gigabytes of RAM, give or take.
[−] mathis 39d ago
If you don't feel like downloading a large model, you can also use yap dictate. Yap leverages the built-in models exposed though Speech.framework on macOS 26 (Tahoe).

Project repo: https://github.com/finnvoor/yap

[−] fiatpandas 39d ago
The clean up prompt needs adjusting. If your transcription is first person and in the voice of talking to an AI assistant, it really wants to “answer” you, completing ignoring its instructions. I fiddled with the prompt but couldn’t figure out how to make it not want to act like an AI assistant.
[−] marktolson 38d ago
I got it to transcribe this: "Create tests and ensure all tests pass" and instead of transcribing exactly what I said it outputs nonsense around "I am a large language model and I cannot create and execute tests".

Other than that issue I like it.

[−] raybb 39d ago
Would also like to know how it compares to https://github.com/openwhispr/openwhispr

I like that openwhisper lets me do on device and set a remote provider.

[−] mft_ 37d ago
Does it show your spoken words on the screen live (i.e. streaming) or does it wait until you’ve finished speaking?

I find it very helpful to see my words live - for some reason it helps my simple brain structure what I’m saying, and I’m much more fluent as a result.

I went on a mission a few weeks ago and tried every freely available MacOS STT app I could find (and there are lots of them) - but none I tried had this feature and was otherwise satisfactory. (I vibe-coded a PoC which could do this, so it’s definitely possible.)

[−] snickell 39d ago
Can somebody help me understand how they use these, I feel like I'm missing something or I'm bad at something?

I only spent 10 minutes with Handy, and a similar amount of time with SuperWhisper, so pretty ignorant. I tried it both with composing this comment, and in a programming session with Codex. I was slightly frustrated to not be hands free, instead of typing, my hands were having to press and release a talk button (option-space in handy, right-command in superwhisper), but then I couldn't submit, so I still had to click enter with Codex.

Additionally, for composing this message, I'm using the keyboard a ton because there's no way I can find to correct text I've typed. Do other people get really reliable and don't need backspace anymore? Or.... what text do you not care enough to edit? Notes maybe?

My point of comparison is using Dragon like 15 years ago. TBH, while the recognition is better (much better) on handy/superwhisper, everything else felt MUCH worse. With dragon, you are (were?) totally hands free, you see text as you say it, and you could edit text really easily vocally when it made a mistake (which it did a fair bit, admittedly). And you could press enter and pretty functionally navigate w/o a keyboard too.

Its weird to see all these apps, and they all have the same limitations?

[−] acjacobson 38d ago
Nice app! Feedback since you asked: The most obvious must-have feature IMO is to paste automatically. Don't require me to hit a shortcut (or at least make it configurable)

The next most critical thing I think is speed and in my tests it's just a little bit slower than other solutions. That matters a lot when it comes to these tools.

The third thing, more of a nice to have is controlling formatting. By this I mean - say a few sentences, then "new line" and the model interprets "new line" as formatting, not as literal text.

[−] hyperhello 39d ago
Feature request or beg: let me play a speech video and transcribe it for me.
[−] rcarmo 39d ago
Not sure why I should use this instead of the baked-in OS dictation features (which I use almost daily--just double-tap the world key, and you're there). What's the advantage?
[−] nidnogg 38d ago
I really like the project and am eager to try and fit this into some of my workflows. However, this bothered me a bit:

"All models run locally, no private data leaves your computer. And it's spicy to offer something for free that other apps have raised $80M to build."

I’d straight up drop the comparison to big AI labs. This isn’t rebellious or subversive, it’s downstream of a ton of already-funded work. Calling it “spicy” is a bit misframed.

[−] ghm2199 39d ago
I've been using handy since a month and its awesome. I mainly use it with coding agents or when I don't want to type into text boxes. How is this different?

Part of the reason handy is awesome is because it uses some of the same rust infra for integrating with the model, so that actually makes it possible to use the code as a library in android or iOS. I have an android app that runs on a local model on the phone too using this.

[−] boudra 39d ago
Interesting, I'm surprised you went with Whisper, I found Parakeet (v2) to be a lot more accurate and faster, but maybe it's just my accent.

I implemented fully local hands free coding with Parakeet and Kokoro: https://github.com/getpaseo/paseo

[−] jwr 38d ago
I currently use MacWhisper and it is quite good, but it's great to see an alternative, especially as I've been looking to use more recent models!

I hope there will be a way to plug in other models: I currently work mostly with Whisper Large. Parakeet is slightly worse for non-English languages. But there are better recent developments.

[−] miki123211 38d ago
What do you actually use for STT, particularly if you prize performance over privacy and are comfortable using your own API keys?

I was on WhisperFlow for a while until the trial ran out, and I'm really tempted to subscribe. I don't think I can go back to a local solution after that, the performance difference is insane.

[−] ianmurrays 38d ago
I had Claude make this hammerspoon config + daemon that does pretty much the same, in case anyone is interested.

https://github.com/ianmurrays/hammerspoon/blob/main/stt.lua

[−] ezVoodoo 38d ago
Hi, nice project! Quick question, when I speak Chinese language, why it output English as translated output? I was using the multilingual (small) model. Do I need to use the Parakeet model to have Chinese output? Thx.
[−] maxmorrish 39d ago
love seeing more local-first tools like this. feels like theres been a real shift since the codebeautify breach last year, people are actually thinking about where there data goes now. nice work on keeping it all on device
[−] aristech 39d ago
Great job. How about the supported languages? System languages gets recognised?
[−] janalsncm 39d ago
I think the jab at the bottom of the readme is referring to whispr flow?

https://wisprflow.ai/new-funding

[−] pdyc 39d ago
interesting, i wanted something like this but i am on linux so i modified whisper example to run on cli. Its quite basic, uses ctrl+alt+s to start/stop, when you stop it copies text to clipboard that's it. Now its my daily driver https://github.com/newbeelearn/whisper.cpp
[−] Supercompressor 39d ago
I've been looking for the opposite - wanting to dump text and it be read to me, coherently. Anyone have good recommendations?
[−] tito 39d ago
This is great. I'm typing this message now using Ghost Pepper. What benefits have you seen from the OCR screen sharing step?
[−] guzik 39d ago
Sadly the app doesn't work. There is no popup asking for microphone permission.

EDIT: I see there is an open issue for that on github

[−] kingofbits 36d ago
Nicely done! Ive been abusing chatgpt's overlay window for this, until now
[−] pmarreck 39d ago
How does this compare with Superwhisper, which is otherwise excellent but not cheap?
[−] jannniii 39d ago
Oh dear, why does it not use apfel for cleanup? No model download necessary…
[−] gegtik 39d ago
how does this compare to macos built in siri TTS, in quality and in privacy?
[−] purplehat_ 39d ago
Hi Matt, there's lots of speech-to-text programs out there with varying levels of quality. 100% local is admirable but it's always a tradeoff and users have to decide for themselves what's worth it.

Would you consider making available a video showing someone using the app?

[−] therealdeal2020 38d ago
btw I know at least a dozen doctors that still pay for software like this. I think doctors are THE profession that likes to use speech-to-text all day every day
[−] imazio 39d ago
is this the support group for people building speech-to-text apps?

I built https://yakki.ai

No regrets so far! XP

[−] leeeeep101 36d ago
i also did this dictatorflow lee101/voicetype i open sourced it also. nice. might be good reference :)
[−] vaulpann 39d ago
very cool - huge open source drop!
[−] thatxliner 39d ago
why isn't the cleanup done on the transcription (as opposed to screen record)
[−] dakila5 39d ago
MacWhisper is also a good one
[−] douglaswlance 39d ago
does it input the text as soon as it hears it? or does it wait until the end?