Big1 sources· last seen 51m ago· first seen 9h ago

llama.cpp: Prefetching weights when offloading to CPU

Hello r/LocalLLaMA, I put up an experimental PR which prefetches weights when offloading to CPU. Long story short from results it helps dense + smaller MoE models for PP (prompt processing). Give it a try if you are ram-rich and gpu-poor like me. [https://github.com/ggml-org/llama.cpp/pull/21067](h

Lead: r/LocalLLaMABigness: 55metacppprefetchingweightsoffloading

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

58 upvotes across 1 sub

📈 Google Trends

Meta AI: 90/100

Full methodology: How scoring works

Receipts (all sources)

llama.cpp: Prefetching weights when offloading to CPU

REDDIT · r/LocalLLaMA · 9h ago · ⬆ 54 · 💬 21

score 114

What do you implement after Llama.cpp?

REDDIT · r/LocalLLaMA · 51m ago · ⬆ 4 · 💬 2

score 110

I'm having a lot of fun playing with llama-server testing various flags, models and runtimes. I'm starting to wonder what's next to build out my homelab AI stack. Do I use Open WebUI for RAG/Search? Should I take a stab at something like LangGraph? My goal is to create as something as close to Claud

Related clusters

Llama.cpp with Turboquant, Heavy-Hitter Oracle (H2O), and StreamingLLM. Even more performance!

1 sources · bigness 49 · 5h ago