Cluster1 sources· last seen 2h ago· first seen 2h ago
[P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...)
I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel? Standard dot-product attention has this quirk where a key vector can "bully" the softmax simply by having a massive magnitude. A random key that
Lead: r/MachineLearningBigness: 23replaceddot-productattentiondistance-basedrbf-attention
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
54
60 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works
Receipts (all sources)
[P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...)
REDDIT · r/MachineLearning · 2h ago · ⬆ 60 · 💬 3
score 125
I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel? Standard dot-product attention has this quirk where a key vector can "bully" the softmax simply by having a massive magnitude. A random key that