Big1 sources· last seen 4h ago· first seen 4h ago
Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)
I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization. At long context (32K on M5 Max), dequant alone was taking around 40 percent of decode time. I tried fixing it the usual way: - register LUTs - SIMD
Lead: r/LocalLLaMABigness: 65skippingdequantworkdecode32k
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
72
237 upvotes across 1 sub
📈 Google Trends
100
Meta AI: 100/100 🔥 spiking
Full methodology: How scoring works
Receipts (all sources)
Skipping 90% of KV dequant work → +22.8% decode at 32K (llama.cpp, TurboQuant)
REDDIT · r/LocalLLaMA · 4h ago · ⬆ 237 · 💬 30
score 130
I’ve been working on an open source TurboQuant implementation for KV cache compression in llama.cpp and ran into a hard bottleneck: dequantization. At long context (32K on M5 Max), dequant alone was taking around 40 percent of decode time. I tried fixing it the usual way: - register LUTs - SIMD