Big1 sources· last seen 4h ago· first seen 4h ago

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)

I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization. At long context (32K on M5 Max), dequant alone was taking around 40 percent of decode time. I tried fixing it the usual way: - register LUTs - SIMD

Lead: r/LocalLLaMABigness: 65skippingdequantworkdecode32k

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

237 upvotes across 1 sub

📈 Google Trends

100

Meta AI: 100/100 🔥 spiking

Full methodology: How scoring works

Receipts (all sources)

Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)

REDDIT · r/LocalLLaMA · 4h ago · ⬆ 237 · 💬 30

score 130