Diffusion Probabilistic Model
My take on it is
1. An autoencoder trained to encode and decode and image
2. Add gausian noise at successive time steps during encoding
3. Remove noise at successive time steps
4. Train on noise or image??
Generate image from image
(Used with CLIP - image + text to generate images from text)
How??
Gaussian Noise N(0,1) p(x) ~= exp(-1/2*pow(|x|,2)) :- Maxwell-Boltzmann distribution of particals
B(t) {0,1} count
std(t) = 1-B(t)
B(t) = 1-std(t-1)/1-std(t)) * B(t) :- Normalize the schedule
Forward diffusion xt(image) = sqrt(1-B(t)*X(t-1) + sqrt(B(t)*Z(t))
Add noise using B(t) and Z(t) = N(0,1) so that at t=lim x(t)=N(0,1)
Z(t) is the fucking noise
q(x) is the encoder - part of autoencoder
p(x) is the decoder
q(x) creates a latent variable described by a mean and std for each time step
p(x) creates a new image
at each time step for each image noise is added to the image using the above schedule
ending with t=T with just a pure noise image
each time step before the current one t-1 the information to reverse the noise for the current step is available which means that
at each time step for each image the loss is calculated as :-
p(xt-1|xt) = reverse diffusion process prob of xt-1 from xt (less noise)
L(t) = Ex(t-1)(-log(p(xt-1|xt)))
L(t) = -log(P(xt-1) - P(xt)) (So I think thats it??)
Its the difference between consecutive outputs t-1 and t that gives the loss.
If you learn the differenc in the noise then you are also learning the signal.