Generally when I want to run something with so much parallelism I just write a small Go program instead, and let Go's runtime handle the scheduling. It works remarkably well and there's no execve() overhead too
So, there are a few reasons why forkrun might work better than this, depending on the situation:
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything.
2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.
Actually the thing that might need that would be make -j. Maybe it would make sense to make a bsd or gnu make version that integrates those optimizations. I am actually running a lot of stuff in cygwin and there is huge fork penalties on windows, so my hope would be to actually get some real payloads faster...
I hate to say it but forkrun probably wont work in cygwin. I haven't tried it, but forkrun makes heavy use of linux-only syscalls trhat I suspect arent available in cygwin.
forkrun might work under WSL2, as its my understanding WSL2 runs a full linux kernel in a hypervisor.
AFAIK, the Go runtime is pretty NUMA-oblivious. The mcache helps a bit with locality of small allocations, but otherwise, you aren't going to get the same benefits (though I absolutely here you about avoiding execve overhead).
So...yes, the execve overhead is real. BUT there's still a lot you can accomplish with pure bash builtins (which don't have the execve overhead). And, if you're open to rewriting things (which would probably be required to some extent if you were to make something intended for shell to run in Go) you can port whatever you need to run into a bash builtin and bypass the execve overhead that way. In fact, doing that is EXACTLY what forkrun does, and is a big part of why it is so fast.
0. vfork (which is sometimes better than CoW fork) + execve if exec is the only outcome of the spawned child. Or, use posix_spawn where available.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
cd /tmp
yes $'\n' | head -n 1000000000 > f1
now lets try frun echo
time { frun echo /dev/null; }
real 0m43.779s
user 20m3.801s
sys 0m11.017s
forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs
mapfile -t -n $N -u $fd A # N is typically 4096 here
echo "${A[@]}"
The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace "${A[@]}" with ${A[*]}, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty.
time { frun -U echo /dev/null; }
real 0m13.295s
user 6m0.567s
sys 0m7.267s
And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (-s) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast.
time { frun -s : < f1; }
real 0m0.985s
user 0m13.894s
sys 0m12.398s
which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.
At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.
I guess I've never really used parallel for anything that was bound by the dispatch speed of parallel itself. I've always use parallel for running stuff like ffmpeg in a folder of 200+ videos, and the speed in which parallel decides to queue up the jobs is going to be very thoroughly eaten by the cost of ffmpeg itself.
Still, worth a shot.
I have to ask, was this vibe-coded though? I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.
ETA: Not vibe coded, I see stuff from four years ago...my mistake!
I like it, and I hope it's soon going to be available in various Linux distributions, along with other modern tools such as fd instead of find, ripgrep instead of grep, and fzf, for instance.
Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
I hit that wall...so I built forkrun.
forkrun is a self-tuning, drop-in replacement for GNU Parallel (and xargs -P) designed for high-frequency, low-latency shell workloads on modern and NUMA hardware (e.g., log processing, text transforms, HPC data prep pipelines).
On my 14-core/28-thread i9-7940x it achieves:
- 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
- ~95–99% CPU utilization across all 28 logical cores (vs ~6% for GNU Parallel)
- Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)
These benchmarks are intentionally worst-case (near-zero work per task), where dispatch overhead dominates. This is exactly the regime where GNU Parallel and similar tools struggle — and where forkrun is designed to perform.
A few of the techniques that make this possible:
- Born-local NUMA: stdin is splice()'d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced.
- SIMD scanning: per-node indexers use AVX2/NEON to find line boundaries at memory bandwidth and publish byte-offsets and line-counts into per-node lock-free rings.
- Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.
- Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.
…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead at every stage.
In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines/sec. In typical streaming workloads it's often 50×–400× faster than GNU Parallel.
forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub "Blame" on the line containing the base64 embeddings).
41 comments
1. if what you want to run is built to be called from a shell (including multi-step shell functions) and not Go. This is the main appeal of forkrun in my opinion - extreme performance without needing to rewrite anything. 2. if you are running on NUMA hardware. Forkrun deals with NUMA hardware remarkably well - it distributes work between nodes almost perfectly with almost 0 cross-node traffic.
forkrun might work under WSL2, as its my understanding WSL2 runs a full linux kernel in a hypervisor.
1. Inner-loop hot path code {sh,c}ould be made a bash built-in after proving that it's the source of a real performance bottleneck. (Just say "no" to premature optimization.) Otherwise, rewrite the whole thing in something performant enough like C, C++, Rust, etc.
2. I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C doing it in 1 thread worker per core.
> I'm curious about the performance is of forkrun "echo ." in a billion jobs vs. say pure C
Short answer: in its fastest mode, forkrun gets very close to the practical dispatch limit for this kind of workload. A tight C loop would still be faster, but at that point you're no longer comparing “parallel job dispatch”—you're comparing raw in-process execution.
Let me try and at least show what kind of performance forkrun gives here. Lets set up 1 billion newlines in a file on a tmpfs
now lets try frun echo forkrun in its "standard mode" hits about 25 million lines per second running newlines through a no-op (:), and ever so slightly less (23 million lines a second) running them through echo. The vast majority of this time is bash overhead. forkrun breaks up the lines into batches of (up to) 4096 (but for 1 billion lines the average batch size is probably 4095). Then for each batch, a worker-specific data-reading fd is advanced to the correct byte offset where the data starts, and the worker runs The second command (specifically the array expansion into a long list of quoted empty args) is what is taking up the vast majority of the time. frun has a flag (-U) then causes it to replace"${A[@]}"with${A[*]}, which (in the case of all empty inputs) collapses the long string of quoted empty args into a long list of spaces -> 0 args. This considerably speeds things up when inputs are all empty. And now we are at 75 million lines per second. But we are still largely limited by passing data through bash....which is why forkrun also has a mode (-s) where it bypasses bash mapfile + array expansion all together and instead splices (via one of the forkrun loadable builtins) data directly to the stdin of whatever you are parallelizing. If you are parallelizing a bash builtin (where there is no execve cost) forkrun gets REALLY fast. which means it is delimiter scanning, dynamically batching and distributing (in batches of up to 4096 lines) at a rate of OVER 1 BILLION LIONES A SECOND or at a rate of ~250,000 batches per second.At that point the bottleneck is basically just delimiter scanning and kernel-level data movement. There’s very little “scheduler overhead” left to remove—whether you write it in bash+C hybrids (like forkrun) or pure C.
Still, worth a shot.
I have to ask, was this vibe-coded though? I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.
ETA: Not vibe coded, I see stuff from four years ago...my mistake!
Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?
I hit that wall...so I built forkrun.
forkrun is a self-tuning, drop-in replacement for GNU Parallel (and xargs -P) designed for high-frequency, low-latency shell workloads on modern and NUMA hardware (e.g., log processing, text transforms, HPC data prep pipelines).
On my 14-core/28-thread i9-7940x it achieves:
- 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
- ~95–99% CPU utilization across all 28 logical cores (vs ~6% for GNU Parallel)
- Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)
These benchmarks are intentionally worst-case (near-zero work per task), where dispatch overhead dominates. This is exactly the regime where GNU Parallel and similar tools struggle — and where forkrun is designed to perform.
A few of the techniques that make this possible:
- Born-local NUMA: stdin is splice()'d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced.
- SIMD scanning: per-node indexers use AVX2/NEON to find line boundaries at memory bandwidth and publish byte-offsets and line-counts into per-node lock-free rings.
- Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.
- Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.
…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead at every stage.
In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines/sec. In typical streaming workloads it's often 50×–400× faster than GNU Parallel.
forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub "Blame" on the line containing the base64 embeddings).
- Benchmarking scripts and raw results: https://github.com/jkool702/forkrun/blob/main/BENCHMARKS
- Architecture deep-dive: https://github.com/jkool702/forkrun/blob/main/DOCS
- Repo: https://github.com/jkool702/forkrun
Trying it is literally two commands:
Happy to answer questions.rushin my test case. Which is along the lines of this: Although to adapt to your style I did this instead: In a directory containing 14k data files. I think your reference should be rush, not a Perl script.