Rising1 sources· last seen 5h ago· first seen 5h ago

My biggest Issue with the Gemma-4 Models is the Massive KV Cache!!

I mean, I have 40GB of Vram and I still cannot fit the entire Unsloth Gemma-4-31B-it-UD-Q8 (35GB) even at 2K context size unless I quantize KV to Q4 with 2K context size? WTF? For comparison, I can fit the entire UD-Q8 Qwen3.5-27B at full context without KV quantization! If I have to run a Q4 Gemm

Lead: r/LocalLLaMABigness: 28biggestissuegemma-4massivecache
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
66
118 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works

Receipts (all sources)

My biggest Issue with the Gemma-4 Models is the Massive KV Cache!!
REDDIT · r/LocalLLaMA · 5h ago · ⬆ 118 · 💬 68
score 124

I mean, I have 40GB of Vram and I still cannot fit the entire Unsloth Gemma-4-31B-it-UD-Q8 (35GB) even at 2K context size unless I quantize KV to Q4 with 2K context size? WTF? For comparison, I can fit the entire UD-Q8 Qwen3.5-27B at full context without KV quantization! If I have to run a Q4 Gemm