Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.
Not really. FPGAs run some simple proven strategies, but hardware guys can't keep up with how fast quants iterate on new strategies - so CPUs stay relevant.
> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.
This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.
This might be useful in a case where a small lookup or similar is often pushed out from cache such that lookups are usually cold. Yet lookup data might by small enough to not cause issue with cache pollution, increased bandwidth or memory consumption.
It could be massively improved with a special CPU instruction for racing dram reads. That might make it actually useful for real applications. As it is, the threading model she used here would make it incredibly difficult to use this in a real program.
There’s no point racing DRAM reads explicitly. Refreshes are infrequent and the penalty is like 5x on an already fast operation, 1% of the time.
What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.
There are already systems that do this in hardware. Any system that has memory mirroring RAS features can do this, notably IBM zEnterprise hardware, you know, the company that this video promoter claims to be one-upping.
The memory controller sends the read to the DIMM that is not refreshing. It is invisible to software, except for the side-effect of having better performance.
Mirroring is more of a reliability feature though, no? From my understanding it’s like RAID where you keep multiple copies plus parity so uncorrectable errors aren’t catastrophic. Makes sense for mainframes which need to survive hardware failures.
Refresh avoidance is a tangential thing the memory controller happens to be able to do in a scheme like that, but you’d really have to be looking at it in a vacuum to bill it as a benefit.
Like I said, it’s all about cache. You’re not going to DRAM if you actually care about performance fluctuations at the scale of refresh stalls.
Clearly, hitting a cache would be the better outcome. The technique suggested here could only apply to unavoidably cold reads, some kind of table that's massive and randomly accessed. Assume it exists, for whatever reason. To answer your question, refresh avoidance is an advertised benefit of hardware mirroring. Current IBM techno-advertising that you can Google yourself says this:
"IBM z17 implements an enhanced redundant array of independent memory (RAIM) design with the following features: ... Staggered memory refresh: Uses RAIM to mask memory refresh latency."
I can google, thanks. My point is that nobody is buying mainframes with redundant memory to avoid refresh stalls. It’s a mostly irrelevant freebie on hardware you bought for fault tolerance.
Do you have evidence that this is a fact? Have you looked at the computing requirements documents for, for example, stock exchanges?
I have it on good evidence that stock exchanges ran on mainframes. They are essentially the counterparty (in a computing sense not a financial sense) in each placed order.
If someone is willing to run a fiberoptic cable from Chicago to New York or New Jersey to exploit reduced propagation delay, admittedly much larger than a refresh stall, wouldn't you think that they or someone else would also be interested in predicting computing stalls. An exchange would face at least a significant reputational risk if it could be exploited that way.
Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.
It is not only not practical, it is a completely useless technique. I got downvoted to negative infinity for mentioning this, but I guess I am the only person who actually read the benchmark. The reason the technique "works" in the benchmark is that all the threads run free and just record their timestamps. The winner is decided post hoc. This behavior is utterly pointless for real systems. In a real system you need to decide the winner online, which means the winner needs to signal somehow that it has won, and suppress the side effects of the losers, a multi-core coordination problem that wipes out most of the benefit of the tail improvement but, more importantly, also massively worsens the median latency.
A more accurate but less inspiring title would be:
RAM Has a Design Tradeoff from 1966. I made another one on top.
The first tradeoff, of 6x fewer transistors for some extra latency,
is immensely beneficial. The second, of reducing some of that extra latency
for extra copies of static data, is beneficial only to some extremely niche application. Still a very educational video about modern memory architecture.
[EDIT: accidental extra copy of this comment deleted]
This is very much worth watching. It is a tour de force.
Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.
Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.
She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.
Halfway through this great video and I have two questions:
1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.
2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.
LaurieWired is so incredibly smart, and so incredibly nerdy :-D
Really enjoyed this video, and I'm pretty picky. I learned a lot, even though I already know (or thought I knew) quite a bit about this subject as it was a particular interest of mine in Comp Sci school. I highly recommend. Skip forward through chunks of the train part though where she is messing around. It does get more informative later though so don't skip all of the train part
In effect if the operating system knew about the DRAM layout, it could for instance double critical data structures and race the processing. Maybe this would be helpful in the networking areas.
On the other hand this can maybe get fixed in hardware by just copying the page that's being refreshed to the side somewhere, eliminating the whole waiting problem. Last but not least, AFAIK writes to a row already recharge the capacitors so there shouldn't be a need to refresh it. What am I missing?
This is a cool idea, very well put through for everyone to understand such an esoteric concept.
However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.
I suppose hardware folks could cook up some kind of RAID 5 style striped RAM layout and utilize a hedging strategy. Write out parity, drop a late parity read instead of using it for error correction. You get better results than double the RAM needed with a striping strategy.
I guess this would have been a nice way to reduce HDD latency as a new RAID mode... oh well.
This is so cool! One thing that occurred to me while watching the video: would it potentially make more sense to just do this in the kernel? That way, you don't have to fight virtual addressing, and I imagine (?) you could even know for sure which channel you're on instead of guessing.
I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?
Would be cool to see the performance difference for llama2.c or see it work for gddr on gpus too with nanogpt, though I guess the latter might or might not be possible because of architecture differences.
Voxel Space[1] could have used this, would that multicore had been prevalent at the time. I recall being fascinated that simply facing the camera north or south would knock off 2fps from an already slow frame rate.
Many of our maps' routes would be laid out in a predominately east or west-facing track to max out our staying within cache lines as we marched our rays up the screen.
So, we needed as much main memory bandwidth as we could get. I remember experimenting with cache line warming to try to keep the memory controllers saturated with work with measurable success. But it would have been difficult in Voxel Space to predict which lines to warm (and when), so nothing came of it.
Tailslayer would have given us an edge by just splitting up the scene with multiprocessing and with a lot more RAM usage and without any other code. Alas, hardware like that was like 15 years in the future. Le sigh.
153 comments
The hedging technique is a cool demo too, but I’m not sure it’s practical.
At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.
Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.
> At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.
That’s my main hang up as well. On one hand this is undeniably cool work, but on the other, efficient cache usage is how you maximize throughput.
This optimizes for (narrow) tail latency, but I do wonder at what performance cost. I would be super interested in hearing about real world use cases.
What’s better is to “race” against cache, which is 100x faster than DRAM. CPUs already of do this for independent loads via out-of-order execution. While one load is stalled waiting for DRAM, another can hit the cache and do some compute in parallel. It’s all already handled at the microarchitectural level.
Refresh avoidance is a tangential thing the memory controller happens to be able to do in a scheme like that, but you’d really have to be looking at it in a vacuum to bill it as a benefit.
Like I said, it’s all about cache. You’re not going to DRAM if you actually care about performance fluctuations at the scale of refresh stalls.
"IBM z17 implements an enhanced redundant array of independent memory (RAIM) design with the following features: ... Staggered memory refresh: Uses RAIM to mask memory refresh latency."
> clear spikes from 70ns to 330ns
Isn't that rather trivial though as a source of tail latency? There's much worse spikes coming from other sources, e.g. power management states within the CPU and possibly other hardware. At the end of the day, this is why simple microcontrollers are still preferred for hard RT workloads. This work doesn't change that in any way.
RAM Has a Design Tradeoff from 1966. I made another one on top.
The first tradeoff, of 6x fewer transistors for some extra latency, is immensely beneficial. The second, of reducing some of that extra latency for extra copies of static data, is beneficial only to some extremely niche application. Still a very educational video about modern memory architecture.
[EDIT: accidental extra copy of this comment deleted]
Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.
Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.
She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.
She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer
You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.
The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?
As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.
Go Laurie!
1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.
2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.
Really enjoyed this video, and I'm pretty picky. I learned a lot, even though I already know (or thought I knew) quite a bit about this subject as it was a particular interest of mine in Comp Sci school. I highly recommend. Skip forward through chunks of the train part though where she is messing around. It does get more informative later though so don't skip all of the train part
On the other hand this can maybe get fixed in hardware by just copying the page that's being refreshed to the side somewhere, eliminating the whole waiting problem. Last but not least, AFAIK writes to a row already recharge the capacitors so there shouldn't be a need to refresh it. What am I missing?
However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.
I guess this would have been a nice way to reduce HDD latency as a new RAID mode... oh well.
Many of our maps' routes would be laid out in a predominately east or west-facing track to max out our staying within cache lines as we marched our rays up the screen.
So, we needed as much main memory bandwidth as we could get. I remember experimenting with cache line warming to try to keep the memory controllers saturated with work with measurable success. But it would have been difficult in Voxel Space to predict which lines to warm (and when), so nothing came of it.
Tailslayer would have given us an edge by just splitting up the scene with multiprocessing and with a lot more RAM usage and without any other code. Alas, hardware like that was like 15 years in the future. Le sigh.
[1] https://en.wikipedia.org/wiki/Voxel_Space