Cloudflare's Gen 13 servers: trading cache for cores for 2x performance (blog.cloudflare.com)

by wmf 30 comments 84 points
Read article View on HN

30 comments

[−] gdwatson 50d ago
I will confess to skimming by the end. But I don’t think they explained how they solved the cache issue except to say they rewrote the software in Rust, which is pretty vague.

Was all the code they rewrote originally in Lua? So was it just a matter of moving from a dynamic language with pointer-heavy data structures to a static language with value types and more control over memory layout? Or was there something else going on?

[−] zozbot234 50d ago
The gains in lower memory footprint and lower demands on memory bandwidth from rewriting stuff to Rust are very real, and they're going to matter a lot with DRAM prices being up 5x or more. It doesn't surprise me at all that they would be getting these results.
[−] jshier 50d ago
They posted about the Rust rewrite last year. https://blog.cloudflare.com/20-percent-internet-upgrade/
[−] alberth 50d ago
It seems like the unspoken takeaway is just how shockingly performant LuaJIT is, even relative to Rust.
[−] HackerThemAll 50d ago
It all should be tuned with an AMD CPU expert, and programmer adjusting code under their guidance to leverage all CPU features.

Did AMD engineers or seasoned hardware experts from server vendor assist in this implementation?

Were the "Nodes Per Socket", "CCX as NUMA", "Last Level Cache as NUMA" settings tested/optimized? I don't see them mentioned in the article. They can make A LOT of difference for different workloads, and there's no single setting/single recommendation that would fit all scenarios.

"The locality of cores, memory, and IO hub/devices in a NUMA-based system is an important factor when tuning for performance” - „AMD EPYC 9005 Processor Architecture Overview” page 7

What was the RAM configuration? 12 DIMM modules (optimal) or 24 (suboptimal)?

Was the virtualization involved? If so, how was it configured? How does bare metal performance compare to virtualized system for this specific code?

So many opportunities to explore not mentioned in the text.

[−] trhway 50d ago
Reminds that time when cheap Celeron with small cache was beating expensive Pentium with large cache (if i remember correctly that Celeron's cache was running at the core frequency while Pentium's was a separate die on half-frequency, and Celeron was very overclockable)
[−] hulitu 50d ago

> trading cache for cores

Viva el Celeron

> for 2x performance

You wish.

Are people at Coudfare so young that they didn't heard about Celeron and Duron ?

[−] synack 50d ago
Is the Linux scheduler aware of shared CPU cache hierarchies? Is there any way we could make the scheduler do better cache utilization rather than pinning processes to cores or offloading these decisions to vendor specific code?
[−] attentive 50d ago
That was annoying to read because there is no easy to see impact of each change. It's FL2 + Gen 13 combined.

I.e. what's the FL2 benchmark on Gen 12 compared to FL1?

[−] AbuAssar 50d ago
Epyc’s naming is beautiful and consistent
[−] otterley 50d ago
This post sponsored by AMD®.
[−] howdyhowdy 50d ago
Someday someone will deploy CXL
[−] danpalmer 50d ago
The tradeoff: The opportunity: Proving it out:

Nah I'm good thanks. Slop takes more effort to read and just raises questions of accuracy. It's just disrespectful to your reader to put that work on them. And in a marketing blog post it's just a bad idea.

[−] 11thDwarf 50d ago
[flagged]