Abstract
Large-scale diffusion models have achieved unprecedented results in (conditional) image synthesis, however, they generally require a large amount of GPU memory and are slow at inference time. To overcome this limitation, we propose to distill the knowledge of pre-trained (teacher) diffusion models into smaller student diffusion models via an approximate score matching objective. For classifier-free guided generation on CIFAR-10, our student model achieves a FID-5K of 8.03 using 273G flops. In comparison, the larger teacher model only achieves a FID-5K of 294 using 424G flops. We present initial experiments on distilling the knowledge of Stable Diffusion, a large scale text-to-image diffusion model, and discuss several promising future directions.
Type
Publication
CVPR workshop on Generative Models for Computer Vision