Testing llama.cpp MTP support on Qwen3.6 - RTX 5090
Setup: \- RTX 5090, 32 GB, Linux \- Built llama.cpp from 4f13cb7 (the official [ghcr.io/ggml-org/llama.cpp:server-cuda](http://ghcr.io/ggml-org/llama.cpp:server-cuda) image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA\_DOCKER\_ARCH=120) \- Unslot
Receipts (all sources)
Setup: \- RTX 5090, 32 GB, Linux \- Built llama.cpp from 4f13cb7 (the official [ghcr.io/ggml-org/llama.cpp:server-cuda](http://ghcr.io/ggml-org/llama.cpp:server-cuda) image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA\_DOCKER\_ARCH=120) \- Unslot
Saw some posts around PP being slower, so they were cautious on trying it. Here's a real-world datapoint. **Settings:** * Headless RTX 3090 24G * OpenCode * Model unsloth's Qwen3.6-27B-MTP-Q4\_K\_M.gguf * 128k context * q8\_0 kv cache * \--spec-draft-n-max: 3 * \--draft-p-min: 0 **Use Cases:**
PR [22673](https://github.com/ggml-org/llama.cpp/pull/22673) has been merged into master! 🎉