Attentional Sequence to Sequence
Posted: Sat Dec 07, 2024 7:36 pm
Attention used in transformers was born from Attentional Seq2Seq
H = Latent hidden vector used to predict word(i)
S = Output for Jth input word(j)
X = Input
C = Context
Attention Rij = Hi-1 * Sj
a(ij) = Softmax(Rij)
Context(i) = Sum{a(ij) * S(j)
Output(i) = Decoder prediction for ith word
Output(i) = RNN(H(i-1),[Xi;Ci])
H = Latent hidden vector used to predict word(i)
S = Output for Jth input word(j)
X = Input
C = Context
Attention Rij = Hi-1 * Sj
a(ij) = Softmax(Rij)
Context(i) = Sum{a(ij) * S(j)
Output(i) = Decoder prediction for ith word
Output(i) = RNN(H(i-1),[Xi;Ci])