Flash-KMeans: Fast and Memory-Efficient Exact K-Means

[−] frakt0x90 57d ago

They created this in service of their video generation model which "clusters and reorders tokens based on semantic similarity using k-means.":

http://arxiv.org/pdf/2505.18875

[−] jacquesm 57d ago

Nice one. K-Means is one of those neat little powertools that once you get the hang of it you find more and more applications for, but it can be a bit slow for larger data sets. So this is very nice to have, thank you matt_d for posting.

[−] leecarraher 57d ago

Do they mean deterministic k-means, k-means++ ... ? Global optimal k-means is NP-Hard, so linear speedups aren't terribly helpful. It's nice, until you add more input. Standard k-means would be nice, or the k-means++ seed algorithm.

[−] wood_spirit 57d ago

Does this have corresponding speed ups or memory gains for normal CPUs too? Just thinking about all the cups of coffee that have been made and drunk while scikit-learn kmeans chugs through a notebook :)

[−] westurner 57d ago

ScholarlyArticle: "Flash-KMeans: Fast and Memory-Efficient Exact K-Means" (2026) https://arxiv.org/abs/2603.09229 .. gscholar: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C43&q=Fla... :

> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;

> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.

> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.

> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.

k-means clustering > Algorithms > Variations: https://en.wikipedia.org/wiki/K-means_clustering#Variations

[−] ghm2199 57d ago

Nice. I guess this must have existed in some form for GPUs before this, and that this is just better with the latest software stack?

This could be generally useful for a lot of algorithms like kmeans(ish) like Cut algorithm that projects points into eigen space — as opposed to the kernel space in Kmeans — need the affinity matrix(distance for K points).

[−] matrix2596 57d ago

looks like flash attention concepts applied to kmeans, nice speedup results

[−] maiconburn 57d ago

[dead]

Flash-KMeans: Fast and Memory-Efficient Exact K-Means (arxiv.org)

14 comments