Cluster1 sources· last seen 2h ago· first seen 2h ago

[P] I replaced Dot-Product Attention with distance-based RBF-Attention (so you don't have to...)

I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel? Standard dot-product attention has this quirk where a key vector can "bully" the softmax simply by having a massive magnitude. A random key that

Lead: r/MachineLearningBigness: 23replaceddot-productattentiondistance-basedrbf-attention
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
54
60 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works

Receipts (all sources)

score 125

I recently asked myself what would happen if we replaced the standard dot-product in self-attention with a different distance metric, e.g. an rbf-kernel? Standard dot-product attention has this quirk where a key vector can "bully" the softmax simply by having a massive magnitude. A random key that