Consistency models are trained from scratch in the paper in addition to distilled from diffusion models. I think it’ll probably just work with text-conditioned generation, but unclear to me w/o much thought how to do the equivalent of classifier-free guidance.
The basic theoretical justification for “consistency models” is the same as for what I’m proposing, yes, but:
that’s using distillation to improve consistency, while I’m proposing vector search to directly train a network in a consistent way
it does unconditional generation, not text-conditioned generation
it doesn’t separate distance and direction
The SnapFusion paper is similar to that paper, but with generation conditioned on text descriptions, which is why I linked that.
Consistency models are trained from scratch in the paper in addition to distilled from diffusion models. I think it’ll probably just work with text-conditioned generation, but unclear to me w/o much thought how to do the equivalent of classifier-free guidance.