-----SPaCeTriPPin-----

Posted: **Sat Dec 07, 2024 7:36 pm**

Attention used in transformers was born from Attentional Seq2Seq

H = Latent hidden vector used to predict word(i)
S = Output for Jth input word(j)
X = Input
C = Context
Attention Rij = Hi-1 * Sj

a(ij) = Softmax(Rij)

Context(i) = Sum{a(ij) * S(j)

Output(i) = Decoder prediction for ith word

Output(i) = RNN(H(i-1),[Xi;Ci])

-----SPaCeTriPPin-----

Attentional Sequence to Sequence

Attentional Sequence to Sequence