One of my favorite moments in HN history was watching the authors of the various search tools decide on a common ".ignore" file as opposed to each having their own: https://news.ycombinator.com/item?id=12568245
I would argue that grep-like tools which read .gitignore violate the Principle of Least Astonishment (POLA). It would be fine if there were a --ignore flag to enable such functionality, but defaulting to it just feels wrong to me. Obviously smarter people than I disagree, but my dumdum head just feels that way.
> Obviously smarter people than I disagree, but my dumdum head just feels that way.
That's absolutely not it. What you're describing is part of the UNIX philosophy: programs should do one thing and do it well, and they should function in a way that makes them very versatile and composable, etc.
And that part of the philosophy works GREAT when everything follows another part of the philosophy: everything should be based on flat text files.
But for a number of reasons, and regardless of whatever we all think of those reasons, we live in a world that has a lot of stuff that is NOT the kind of flat text file grep was made for. Binary formats, minified JS, etc. And so to make the tool more practical on a modern *nix workstation, suddenly more people want defaults that are going to work on their flat text files and transparently ignore things like .git.
It's just that you've showed up to an wildly unprincipled world armed with principles.
Sure, but that UNIX philosophy is what got us "grep -r" as the way to search files across an entire directory, which would then compose with stuff like xargs and parallel to be able to do things concurrently. I'd argue that ripgrep shows that that bundling together stuff sometimes does end up with a user experience that people prefer. The nuance lies in figuring out where the balance between "not enough" and "too much" lies, and so far I've yet to see a pithy statement like the UNIX philosophy encapsulate it well.
Alternately, maybe people's idea of what "one thing" is ends up being more subjective than it sounds (or at least depends on context). "Searching through my code" at least sounds like a reasonable idea of "one thing", and it's not crazy that someone might consider "don't search though the stuff that isn't my code, like my npm dependencies or my Rust build artifacts" would be part of "doing it well". Having to specify it every time would be annoying, so you might want to put it in a config file, but then if then if it ends up being identical to your gitignore, having to manually symlink it or copy it each time you modify it is annoying, so it's also not crazy to just use the gitignore by default with a way to opt out of it. Now we're just back where we started; custom .ignore files, fallback to .gitignore, and a flag for when you want to skip that.
Back in the day I would have agreed with you, but ever since there is js everywhere you end up with minified js that megabytes big and match everything. I still have muscle memory with grep -r and it almost always ends up with some js file, that I didn't know exists ruining the moment.
> Obviously smarter people than I disagree, but my dumdum head just feels that way.
No you are correct, do not doubt yourself. Baked in behavior catering to a completely separate tool is bad design. Git is the current version control software but its not the first nor last. Imagine if we move to another source control and are burdened with .gitignore files. No thanks.
The Unix tools are designed to be good and explicit at their individual jobs so they can be easily composed together to form more complex tools that cater to the task at hand.
I have to agree here. I love ripgrep, but at times I've had to go back to regular grep because I couldn't figure out what it was ignoring and why, and there were far too many settings to figure it out.
It's a tough one. Lately I've been doing rg -u every single time because too many things get ignored and I can't be bothered to figure out how to configure it more cleanly to do what I want by default.
You are absolutely right. It is a good feature, but it must be a concious decision. It should not be default. You should set it in your shell alias or environment, just like you have something like
LESS="-FQMR"
(no bell, more status, raw characters, exit if less than one page).
Those are also completely reasonable to use, but they must set conciously, otherwise the might give results that confuse the user.
I’ve read this multiple times over the years and this post is still the most interesting and informative piece describing the problem of making a fast grep-like tool. I love that it doesn’t just describe how ripgrep works but also how all the other tools work and then compares the various techniques. It’s simultaneously a tutorial and an expert deep dive. Just a beautiful piece of writing. In a perfect world, all code would be similarly documented.
Such a good read. I actually went back though it the other day to steal the searching for the least common byte idea out to speed up my search tool https://github.com/boyter/cs which when coupled with the simd upper lower search technique from fzf cut the wall clock runtime by a third.
There was this post from cursor https://cursor.com/blog/fast-regex-search today about building an index for agents due to them hitting a limit on ripgrep, but I’m not sure what codebase they are hitting that warrants it. Especially since they would have to be at 100-200 GB to be getting to 15s of runtime. Unless it’s all matches that is.
When I first heard about ripgrep my reaction was laughing. grep had been too established. No way something that isn't 100% compatible with grep could get any traction.
And I was dead wrong. Overnight everyone uses rg (me included).
I was using ripgrep once and it had a bug that led me downa terrifying rabbit hole - I can't recall what it was but it involved not being able to find text that absolutely should have been there.
Eventually I was considering rebuilding the machine completely but for some reason after a very long time digging deep into the rabbit hole I tried plain old grep and there was the data exactly where it should have been.
So it's such a vague story but it was a while back - I don't remember the specifics but I sure recall the panic.
Ripgrep is used as the defautl search backend for ArchiveBox, such a good tool. I was on ag (the-silver-searcher) for years before I switched, but haven't gone back since.
One thing I learned over the years is that the closer my setup is to the default one, the better. I tried switching to the latest and greatest replacements, such as ack or ripgrep for grep, or httpie for curl, just to always return to the default options. Often, the return was caused by a frustration of not having the new tools installed on the random server I sshed to. It's probably just me being unable to persevere in keeping my environment customized, and I'm happy to see these alternative tools evolve and work for other people.
I don’t understand when people typeset some name in verbatim, lowercase, but then have another name for the actual command. That’s confusing to me.
Programmers are too enarmored with lower-case names. Why not Ripgrep? Then I can surmise that there might not be some program ripgrep(1) (there might be a shorter version), since using capital letters is not traditional for CLI programs.
> Stacked Git, StGit for short, is an application for managing Git commits as a stack of patches.
> ... The stg command line tool ...
Now, I’ve been puzzled in the past when inputing stgit doesn’t work. But here they call it StGit for short and the actual command is typeset in verbatim (stg(1) would have also worked).
I don't remember why I didn't switch from ag, but I remember it was a conscious decision. I think it had something to do with configuration, rg using implicit '.ignore' file (a super-generic name instead of a proper tool-specific config) or even .gitignore, or something else very much unwarranted, that made it annoying to use. Cannot remember, really, only remember that I spent too much time trying to make it behave and decided it isn't worth it. Anyway, faster is nice, but somehow I don't ever feel that ag is too slow for anything. The switch from the previous one (what was it? ack?) felt like a drastic improvement, but ag vs. rg wasn't much difference to me in practice.
It seems this was possible because ripgrep is inefficient in CPU usage when runs multithreaded and uses about 2x times more CPU time in comparison to GNU grep.
I don't know if this is coincidence or not but Cursor just made a post breaking down why they moved to their own solution in place or Ripgrep and it makes a lot of sense from a cursory (haha) read.
Last week I experienced a data truncation issue where I ran an rg -zF fixed string search piped into another rg -F. The dataset was roughly 10 million lines. Doing a single rg -z with a regex glob in the middle didn't encounter that issue.
And burntsushi is one of us: he's regularly here on HN. Big thanks to him. As soon as rg came out I was building it on Linux. Now it ships stocks with Debian (since Bookworm? Don't remember): thanks, thanks and more thanks.
It seems to me that rg is the number one most important part that enables LLMs to be smart agents in a codebase. Who would have thought that a code search tool would enable AGI?
Faster is not always the best thing. I still remember when vs code changed to ripgrep I had to change my habit using it, before then I can just open vs code to any folder and do something with it, even if the folder contains millions of small text files. It worked fine before, but then rg was picked, and it happily used all of my cpu cores scanning files, made me unable to do anything for awhile.
To be honest I hate all the new rust replacement tools, they introduce new behavior just for the sake of it, it's annoying.
160 comments
> Obviously smarter people than I disagree, but my dumdum head just feels that way.
That's absolutely not it. What you're describing is part of the UNIX philosophy: programs should do one thing and do it well, and they should function in a way that makes them very versatile and composable, etc.
And that part of the philosophy works GREAT when everything follows another part of the philosophy: everything should be based on flat text files.
But for a number of reasons, and regardless of whatever we all think of those reasons, we live in a world that has a lot of stuff that is NOT the kind of flat text file grep was made for. Binary formats, minified JS, etc. And so to make the tool more practical on a modern *nix workstation, suddenly more people want defaults that are going to work on their flat text files and transparently ignore things like .git.
It's just that you've showed up to an wildly unprincipled world armed with principles.
Alternately, maybe people's idea of what "one thing" is ends up being more subjective than it sounds (or at least depends on context). "Searching through my code" at least sounds like a reasonable idea of "one thing", and it's not crazy that someone might consider "don't search though the stuff that isn't my code, like my npm dependencies or my Rust build artifacts" would be part of "doing it well". Having to specify it every time would be annoying, so you might want to put it in a config file, but then if then if it ends up being identical to your gitignore, having to manually symlink it or copy it each time you modify it is annoying, so it's also not crazy to just use the gitignore by default with a way to opt out of it. Now we're just back where we started; custom .ignore files, fallback to .gitignore, and a flag for when you want to skip that.
grep -rand it almost always ends up with some js file, that I didn't know exists ruining the moment.> Obviously smarter people than I disagree, but my dumdum head just feels that way.
No you are correct, do not doubt yourself. Baked in behavior catering to a completely separate tool is bad design. Git is the current version control software but its not the first nor last. Imagine if we move to another source control and are burdened with .gitignore files. No thanks.
The Unix tools are designed to be good and explicit at their individual jobs so they can be easily composed together to form more complex tools that cater to the task at hand.
rg -uevery single time because too many things get ignored and I can't be bothered to figure out how to configure it more cleanly to do what I want by default.Those are also completely reasonable to use, but they must set conciously, otherwise the might give results that confuse the user.
[1]: https://ugrep.com/
But that’s the kind of problem that only successful things have to worry about.
--ignore-file=flag would be nice I guess:--ignore-file=.ignore
--ignore-file=.gitignore
--ignore-file=.dockerignore
--ignore-file=.npmignore
etc
but then, assuming all those share the same "ignore file syntax/grammar"...
There was this post from cursor https://cursor.com/blog/fast-regex-search today about building an index for agents due to them hitting a limit on ripgrep, but I’m not sure what codebase they are hitting that warrants it. Especially since they would have to be at 100-200 GB to be getting to 15s of runtime. Unless it’s all matches that is.
And I was dead wrong. Overnight everyone uses rg (me included).
It’s fast even on a 300mhz Octane.
Eventually I was considering rebuilding the machine completely but for some reason after a very long time digging deep into the rabbit hole I tried plain old grep and there was the data exactly where it should have been.
So it's such a vague story but it was a while back - I don't remember the specifics but I sure recall the panic.
There's also RGA (ripgrep-all) which searches binary files like PDFs, ebooks, doc files: https://github.com/phiresky/ripgrep-all
grepit's actually usingrgunderneath> The binary name for
ripgrepisrg.I don’t understand when people typeset some name in verbatim, lowercase, but then have another name for the actual command. That’s confusing to me.
Programmers are too enarmored with lower-case names. Why not Ripgrep? Then I can surmise that there might not be some program ripgrep(1) (there might be a shorter version), since using capital letters is not traditional for CLI programs.
Look at Stacked Git:
https://stacked-git.github.io/
> Stacked Git, StGit for short, is an application for managing Git commits as a stack of patches.
> ... The
stgcommand line tool ...Now, I’ve been puzzled in the past when inputing
stgitdoesn’t work. But here they call it StGit for short and the actual command is typeset in verbatim (stg(1) would have also worked).https://hwisnu.bearblog.dev/building-cgrep-using-safe_ch-cus...
It seems this was possible because ripgrep is inefficient in CPU usage when runs multithreaded and uses about 2x times more CPU time in comparison to GNU grep.
https://hwisnu.bearblog.dev/levelized-cost-of-resources-in-b...
With 240 log files in various subfolders.
grep -q -r "22:02" --include=".log" 4.15s user 0.09s system 99% cpu 4.269 total
grep -q -r "22:02" --include=".log" 4.18s user 0.09s system 99% cpu 4.265 total
grep -q -r "22:02" --include="*.log" 4.31s user 0.09s system 99% cpu 4.401 total
rg -q "22:02" -t log 0.01s user 0.01s system 83% cpu 0.018 total
rg -q "22:02" -t log 0.01s user 0.01s system 93% cpu 0.017 total
rg -q "22:02" -t log 0.01s user 0.01s system 95% cpu 0.018 total
I really did not expect it to be that fast.
https://cursor.com/blog/fast-regex-search
https://reddit.com/r/rust/comments/1fvzfnb/gg_a_fast_more_li...
https://x.com/CharlieMQV/status/1972647630653227054
The TUI is great, and approximate matches are insanely useful.
Someone please make an awesome new sed and awk.
rgis the number one most important part that enables LLMs to be smart agents in a codebase. Who would have thought that a code search tool would enable AGI?To be honest I hate all the new rust replacement tools, they introduce new behavior just for the sake of it, it's annoying.