Cluster1 sources· last seen 2h ago· first seen 2h ago

[P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...)

I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel? Standard dot-product attention has this quirk where a key vector can "bully" the softmax simply by having a massive magnitude. A random key that

Lead: r/MachineLearningBigness: 23replaceddot-productattentiondistance-basedrbf-attention

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

60 upvotes across 1 sub

📈 Google Trends

Full methodology: How scoring works

Receipts (all sources)

[P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...)

REDDIT · r/MachineLearning · 2h ago · ⬆ 60 · 💬 3

score 125