I've been working on a karaoke app called Nightingale. You point it at your music folder and it turns your songs into karaoke - separates vocals from instrumentals, generates word-level synced lyrics, and lets you sing with highlighted lyrics and pitch scoring. Works with video files too.
Everything runs locally on your machine, nothing gets uploaded. No accounts, no subscriptions, no telemetry.
It ships as a single binary for Linux, macOS, and Windows. On first launch it sets up its own isolated Python environment and downloads the ML models it needs - no manual installation of dependencies required.
My two biggest drivers for the creation of this were:
The lack of karaoke coverage for niche, avant-garde, and local tracks.
Nostalgia for the good old cheesy karaoke backgrounds with flowing rivers, city panoramas, etc.
Some highlights:
Stem separation using the UVR Karaoke model (preserves backing vocals) or Demucs
Automatic lyrics via WhisperX transcription, or fetched from LRCLIB when available
Pitch scoring with player profiles and scoreboards
Gamepad support and TV-friendly UI scaling for party setups
GPU acceleration on NVIDIA (CUDA) and Apple Silicon (CoreML/MPS)
Built with Rust and the Bevy engine
The whole stack is open source. No premium tier, no "open core" - just the app.
Just tried it with B.E.D - Walk Away[0], unfortunately it lost track of the lyrics after 30 secs (Model is "large-v3"). Will play around a bit more, as it would be great to have a working karaoke generator.
Some quick feedback:
- Needs a way to skip for-/backwards during playback to validate the result
- Sentences seem to be recognized (first letter has uppercasing), but periods aren't added
- Needs an option to edit results from the track analysis
Amazing work! I am thrilled someone was motivated to approach this problem and develop a creative solution like this. There are very limited options for Karaoke, especially in the FOSS space. Most Karaoke apps are super limited and that's driven many Karaoke enjoyers I know to YouTube in search of the songs they want to sing. This solution would give them the power to do even more songs, even better than what's out there now!
Questions for you:
1. What CUDA capability level is necessary for Nvidia GPU accelleration to work?
3. Are there any plans to support iGPU/NPU accelleration on AMD and Intel? Asking because those chips are most common in the mini computers sold at low cost these days.
My family members who love Karaoke and will be happy to try this. Looking forward to it!
I studied signal processing in university and my career evolved to not use what I studied. Decades ago, giving an algorithm a sound file and isolating tracks was difficult.
How does your implementation accomplish this? Were you involved or did you use something off the shelf?
Edit: ah, using neural nets, demucs. I wonder if there is pure math approach that can compete?
Just downloaded source and built this to play around with it. I was a bit surprised that the first thing it did when I ran it was to start downloading binaries from the internet. It went off to fetch FFMpeg from some remote server, but I already have FFMpeg installed. Then it tried and failed to install its own Python interpreter, which is another thing that's already present on the system.
How come this is trying to install its own vendored dependencies, including executable binaries, instead of checking for what's already installed? That approach can lead to both security and performance issues.
Edit: the Python download isn't failing, but rather the application itself is looking for the executable interpreter in lib rather than bin once the download completes. I built the release tarball in the git repo, and I'm pretty amazed that such a basic error could make it into release code.
Further edit: I tried using the build script in the tarball rather than just doing a cargo build -r, and it started trying to install Docker containers! Docker to build a desktop application! What is going on here?
This gave me a blast to the past to Nightingale, the media player built on top of Firefox. It was a Firefox fork that was aiming to be a more powerful alternative to iTunes/Winamp. But since it was built on Firefox, you could also use it as an all-in-one media player and web browser.
The homepage still exists, but it looks like many of the other pages like the blog and wiki are long gone. It hasn't been active in probably over a decade.
I've worked on a small toy project with a similar purpose in the past [1], though it's not nearly as polished as yours, and I've made some questionable decisions here and there.
I have questions about pitch tracking. It seems you do track the pitch for scoring, and there's a line at the top of the screen that seems related but that I can't figure out. For my use case, an important feature of karaoke apps is displaying how "high" the next note should be sung, or at least some hints. Is it something your app can do and I just haven't figured it out? Or would it be a feature request?
Looking at the commit history, this came together pretty fast. Assuming it's AI-assisted (hard to know for sure), it's a good example of the opposite of the "AI replaces developers" narrative.
AI is making whole categories of projects viable that simply weren't before. Not because they were technically impossible, but because they were too time-consuming for a niche audience to justify the effort.
My wife is a huge karaoke fan. I'm especially interested in the pitch scoring since we usually play the karaoke games on older consoles for that exact feature. Nobody really makes games like that anymore without a subscription (and most of these good modern karaoke platforms are exclusive to east asia anyways). If this works well this could make for some really fun social events, looking forward to trying this.
This looks great, but I don't understand what it's supposed to do. I assumed the idea was "remove the lyrics" but of the 5 songs I tried (from Cry Cry Cry, Indigo Girls, and Suzanne Vega), none seemed to have any change from the original at all - it's showing the words on the screen (and the timing is perfect) but it's not removing the singing at all. How do you turn off the singing?
Really nice project, I'm looking forward to trying it!
Would it be possible to process songs on one device, and then use the result in another, or even multiple? Or would it be possible to run as separate server / client?
I ask mainly because the device I connect to my TV is definitely not the most powerful one, so it would be nice if I can preprocess the songs elsewhere.
Nice work! If you are looking for ways to enhance this or complementary routes, one thing I was thinking about recently... As a musician, often I play songs I don't know the lyrics to. It would be cool to have an app that could follow along karaoke style with the words, as I sing and as the band plays. Right now I clip a phone to the mic stand, but after a lyric or two, I lose my place. This is probably multitudes more complex based on every "band/vocals" sounding different, but just something I was thinking about.
This is cool, though for me personally my minimum quality bar for the lyrics and sync accuracy in karaoke is much higher than what's achievable with WhisperX / Force Alignment. I moved on from that approach about 2 years ago in my pursuit of high quality karaoke.
Can the software dump the recognised pitch and lyrics and timings to Performous text format? There's no formal specification, but examples are available on https://performous.org/songs
I introduced it to my friend (who is not technical, and has a RTX 3080), his experience was: he selected a 3:32 song, it took 10-15 minutes trying to scan it. In the end it did a "decent job of removing lyrics from the audio and providing accurate ones" "It seemed to be tracking my singing but there is no mic setup or anything like it" "And I think I accidentally wiped the AI model data it took forever to generate with a single click"
In the end he's gone back to karaoke videos on youtube but it seems promising.
Big karaoke fan, so thanks for doing this. I'm processing a first test song as I write. The pitch scoring sounds really interesting as both a competitive and maybe also a training tool.
A couple of immediate small pieces of feedback:
* The colour scheme on the queue/nn% buttons is really low contrast - white on pale yellow is very hard to read
* the 'models' button (bottom left) - I assumed this would give me details about which models are available, and the sizes, but instead deleted the downloaded models without warning. Maybe add a 'are you sure you want to...' check?
This is great! I thought of doing something like this for Karaoke, but was wondering about the copyright implications of doing it server-side.
We already do this for ingesting podcasts and cutting their clips with text being highlighted as people speak. AssemblyAI also supports speaker diarization.
For videos recorded using our own livestreaming studio, we can bypass all this by using Web STT and TTS APIs resulting in perfect timing and diarization without the need for server side models.
162 comments
Everything runs locally on your machine, nothing gets uploaded. No accounts, no subscriptions, no telemetry.
It ships as a single binary for Linux, macOS, and Windows. On first launch it sets up its own isolated Python environment and downloads the ML models it needs - no manual installation of dependencies required.
My two biggest drivers for the creation of this were:
Some highlights: The whole stack is open source. No premium tier, no "open core" - just the app.Feedback and contributions welcome.
Some quick feedback:
Thanks for keeping it FOSS![0]: https://www.youtube.com/watch?v=_MFT4H3VoNE
Questions for you:
1. What CUDA capability level is necessary for Nvidia GPU accelleration to work?
3. Are there any plans to support iGPU/NPU accelleration on AMD and Intel? Asking because those chips are most common in the mini computers sold at low cost these days.
My family members who love Karaoke and will be happy to try this. Looking forward to it!
How does your implementation accomplish this? Were you involved or did you use something off the shelf?
Edit: ah, using neural nets, demucs. I wonder if there is pure math approach that can compete?
How come this is trying to install its own vendored dependencies, including executable binaries, instead of checking for what's already installed? That approach can lead to both security and performance issues.
Edit: the Python download isn't failing, but rather the application itself is looking for the executable interpreter in
librather thanbinonce the download completes. I built the release tarball in the git repo, and I'm pretty amazed that such a basic error could make it into release code.Further edit: I tried using the build script in the tarball rather than just doing a
cargo build -r, and it started trying to install Docker containers! Docker to build a desktop application! What is going on here?The homepage still exists, but it looks like many of the other pages like the blog and wiki are long gone. It hasn't been active in probably over a decade.
https://getnightingale.com/
I've worked on a small toy project with a similar purpose in the past [1], though it's not nearly as polished as yours, and I've made some questionable decisions here and there.
I have questions about pitch tracking. It seems you do track the pitch for scoring, and there's a line at the top of the screen that seems related but that I can't figure out. For my use case, an important feature of karaoke apps is displaying how "high" the next note should be sung, or at least some hints. Is it something your app can do and I just haven't figured it out? Or would it be a feature request?
[1] https://github.com/eckter/karaoke_helper
AI is making whole categories of projects viable that simply weren't before. Not because they were technically impossible, but because they were too time-consuming for a niche audience to justify the effort.
Thanks for the cool project! (testing now)
Would it be possible to process songs on one device, and then use the result in another, or even multiple? Or would it be possible to run as separate server / client?
I ask mainly because the device I connect to my TV is definitely not the most powerful one, so it would be nice if I can preprocess the songs elsewhere.
If anyone's interested in tooling to generate karaoke videos which unlocks a much higher quality bar, check out my karaoke-gen project - 2 free generation credits here: https://gen.nomadkaraoke.com, open-source code here: https://github.com/nomadkaraoke/karaoke-gen
In the end he's gone back to karaoke videos on youtube but it seems promising.
A couple of immediate small pieces of feedback:
* The colour scheme on the queue/nn% buttons is really low contrast - white on pale yellow is very hard to read
* the 'models' button (bottom left) - I assumed this would give me details about which models are available, and the sizes, but instead deleted the downloaded models without warning. Maybe add a 'are you sure you want to...' check?
[1] https://www.karafun.com/
We already do this for ingesting podcasts and cutting their clips with text being highlighted as people speak. AssemblyAI also supports speaker diarization.
For videos recorded using our own livestreaming studio, we can bypass all this by using Web STT and TTS APIs resulting in perfect timing and diarization without the need for server side models.