Page 1 of 1

Attentional Sequence to Sequence

PostPosted: Sat Dec 07, 2024 7:36 pm
by hbyte
Attention used in transformers was born from Attentional Seq2Seq

H = Latent hidden vector used to predict word(i)
S = Output for Jth input word(j)
X = Input
C = Context
Attention Rij = Hi-1 * Sj

a(ij) = Softmax(Rij)

Context(i) = Sum{a(ij) * S(j)

Output(i) = Decoder prediction for ith word

Output(i) = RNN(H(i-1),[Xi;Ci])