Kyle's paper on Stochastic Attention accepted to ICML 2026!

Kyle Lee's paper on Stochastic Attention is accepted to International Conference on Machine Learning 2026

May 5, 2026

What can probabilistic computing do for the formidable memory wall in AI inference?

In our recently accepted #ICML2026 paper, on the arXiv tonight, we develop a series of stochastic algorithms showing that much of the memory movement behind every token we generate is, in a precise sense, redundant.

We call our method SANTA: Stochastic Additive No-mulT Attention. SANTA sparsifies value-cache access by sampling a small set of indices from the post-softmax distribution and aggregating only those value rows.

Multiply-accumulates become gather-and-add. We measure end-to-end speedups over optimized kernels on present day GPUs, however, most of the advantage lies in developing the right Stochastic Processing Units to do the job.

We also introduce Bernoulli qK^T sampling to sparsify the score stage using stochastic ternary queries. Both techniques are orthogonal to ternary quantization, low-rank projections, and KV-cache compression.

The bigger picture: these algorithms point toward sparse, multiplier-free inference, exactly the regime where probabilistic, near-memory, and in-memory compute hardware can deliver order-of-magnitude energy gains over general-purpose GPUs.

Kyle's paper on Stochastic Attention accepted to ICML 2026!

Contact

Website