Rising1 sources· last seen 5h ago· first seen 10h ago

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting [great MTP performance](https://www.reddit.com/r/LocalLLaMA/comments/1t82zxv/80_toksec_and_128k_context_on_12gb_vram_with/) with [llama.cpp](https://github.com/ggml-org/llama.cpp) on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was bare

Lead: r/LocalLLaMABigness: 31110tok12gbvramqwen3
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
75
265 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works

Receipts (all sources)

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp
REDDIT · r/LocalLLaMA · 5h ago · ⬆ 167 · 💬 58
score 127

Had been getting [great MTP performance](https://www.reddit.com/r/LocalLLaMA/comments/1t82zxv/80_toksec_and_128k_context_on_12gb_vram_with/) with [llama.cpp](https://github.com/ggml-org/llama.cpp) on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was bare

Qwen3.6 27B and llama.cpp appreciation post
REDDIT · r/LocalLLaMA · 10h ago · ⬆ 98 · 💬 57
score 116

To preface, here's my config: llama-server \    --host 0.0.0.0 \    --port 1235 \    --models-preset %h/Software/models.ini \    --models-max 1 \    --sleep-idle-seconds 3600 \    --timeout 3600 \    --parallel 1 \    --device ROCm0,ROCm1 [*] flash-attn