I will confess to skimming by the end. But I don’t think they explained how they solved the cache issue except to say they rewrote the software in Rust, which is pretty vague.
Was all the code they rewrote originally in Lua? So was it just a matter of moving from a dynamic language with pointer-heavy data structures to a static language with value types and more control over memory layout? Or was there something else going on?
The gains in lower memory footprint and lower demands on memory bandwidth from rewriting stuff to Rust are very real, and they're going to matter a lot with DRAM prices being up 5x or more. It doesn't surprise me at all that they would be getting these results.
It all should be tuned with an AMD CPU expert, and programmer adjusting code under their guidance to leverage all CPU features.
Did AMD engineers or seasoned hardware experts from server vendor assist in this implementation?
Were the "Nodes Per Socket", "CCX as NUMA", "Last Level Cache as NUMA" settings tested/optimized? I don't see them mentioned in the article. They can make A LOT of difference for different workloads, and there's no single setting/single recommendation that would fit all scenarios.
"The locality of cores, memory, and IO hub/devices in a NUMA-based system is an important factor when tuning for performance” - „AMD EPYC 9005 Processor Architecture Overview” page 7
What was the RAM configuration? 12 DIMM modules (optimal) or 24 (suboptimal)?
Was the virtualization involved? If so, how was it configured? How does bare metal performance compare to virtualized system for this specific code?
So many opportunities to explore not mentioned in the text.
Reminds that time when cheap Celeron with small cache was beating expensive Pentium with large cache (if i remember correctly that Celeron's cache was running at the core frequency while Pentium's was a separate die on half-frequency, and Celeron was very overclockable)
Is the Linux scheduler aware of shared CPU cache hierarchies? Is there any way we could make the scheduler do better cache utilization rather than pinning processes to cores or offloading these decisions to vendor specific code?
Nah I'm good thanks. Slop takes more effort to read and just raises questions of accuracy. It's just disrespectful to your reader to put that work on them. And in a marketing blog post it's just a bad idea.
30 comments
Was all the code they rewrote originally in Lua? So was it just a matter of moving from a dynamic language with pointer-heavy data structures to a static language with value types and more control over memory layout? Or was there something else going on?
Did AMD engineers or seasoned hardware experts from server vendor assist in this implementation?
Were the "Nodes Per Socket", "CCX as NUMA", "Last Level Cache as NUMA" settings tested/optimized? I don't see them mentioned in the article. They can make A LOT of difference for different workloads, and there's no single setting/single recommendation that would fit all scenarios.
"The locality of cores, memory, and IO hub/devices in a NUMA-based system is an important factor when tuning for performance” - „AMD EPYC 9005 Processor Architecture Overview” page 7
What was the RAM configuration? 12 DIMM modules (optimal) or 24 (suboptimal)?
Was the virtualization involved? If so, how was it configured? How does bare metal performance compare to virtualized system for this specific code?
So many opportunities to explore not mentioned in the text.
> trading cache for cores
Viva el Celeron
> for 2x performance
You wish.
Are people at Coudfare so young that they didn't heard about Celeron and Duron ?
I.e. what's the FL2 benchmark on Gen 12 compared to FL1?
Nah I'm good thanks. Slop takes more effort to read and just raises questions of accuracy. It's just disrespectful to your reader to put that work on them. And in a marketing blog post it's just a bad idea.