Big1 sources· last seen 5h ago· first seen 5h ago

Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!

After the great work yesterday of TheTom's work on showing Turboquant working in Llama.cpp I added a few other things that added some more complimentary speedups to Llama.cpp. so far CPU and CUDA build and are fully usable. I'm seeing full speed token generation on my 16gb 4060ti up to 256k+ context

Lead: r/LocalLLaMABigness: 49metacppturboquantheavy-hitteroracle

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

13 upvotes across 1 sub

📈 Google Trends

Meta AI: 90/100

Full methodology: How scoring works

Receipts (all sources)

Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!

REDDIT · r/LocalLLaMA · 5h ago · ⬆ 13 · 💬 12

score 109

Related clusters

llama.cpp: Prefetching weights when offloading to CPU

1 sources · bigness 55 · 52m ago