Multimodal Representation Understanding Based on Text-to-Image Generation

Final project for the course CSC2516 - Neural Networks and Deep Learning at the University of Toronto.

Abstract

The capability of representation learning is an important factor to assess how well a model can perform on transfer learning tasks. The state of the art model CLIP has proved that representations generated by multimodal learning have strong generalization property and can achieve high performance in downstream tasks with only zero-shot learning. In this project, we use text-to-image generation as our downstream task to compare representations learned by CLIP with the ones learned by BERT. We show that representations learned by CLIP have better image generation performance and have stronger zero-shot generation ability.

Full Report / Code