Cluster1 sources· last seen 2h ago· first seen 2h ago

Running Llama 3.2 1B entirely on an AMD NPU on Linux (Strix Halo, IRON framework, 4.4 tok/s)

I got Llama 3.2 1B running inference entirely on the AMD NPU on Linux. Every operation (attention, GEMM, RoPE, RMSNorm, SiLU, KV cache) runs on the NPU; no CPU or GPU fallback. As far as I can tell, this is the first time anyone has publicly documented this working on Linux. ## Hardware - AMD Ryze

Lead: r/LocalLLaMABigness: 17runningmetaentirelyamdnpu
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
35
13 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works

Receipts (all sources)

I got Llama 3.2 1B running inference entirely on the AMD NPU on Linux. Every operation (attention, GEMM, RoPE, RMSNorm, SiLU, KV cache) runs on the NPU; no CPU or GPU fallback. As far as I can tell, this is the first time anyone has publicly documented this working on Linux. ## Hardware - AMD Ryze