Nice one. K-Means is one of those neat little powertools that once you get the hang of it you find more and more applications for, but it can be a bit slow for larger data sets. So this is very nice to have, thank you matt_d for posting.
Do they mean deterministic k-means, k-means++ ... ? Global optimal k-means is NP-Hard, so linear speedups aren't terribly helpful. It's nice, until you add more input. Standard k-means would be nice, or the k-means++ seed algorithm.
Does this have corresponding speed ups or memory gains for normal CPUs too? Just thinking about all the cups of coffee that have been made and drunk while scikit-learn kmeans chugs through a notebook :)
> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;
> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.
> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.
> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.
Nice. I guess this must have existed in some form for GPUs before this, and that this is just better with the latest software stack?
This could be generally useful for a lot of algorithms like kmeans(ish) like Cut algorithm that projects points into eigen space — as opposed to the kernel space in Kmeans — need the affinity matrix(distance for K points).
14 comments
http://arxiv.org/pdf/2505.18875
> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;
> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.
> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.
> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.
k-means clustering > Algorithms > Variations: https://en.wikipedia.org/wiki/K-means_clustering#Variations
This could be generally useful for a lot of algorithms like kmeans(ish) like Cut algorithm that projects points into eigen space — as opposed to the kernel space in Kmeans — need the affinity matrix(distance for K points).