Ji
Xin,
PhD
candidate
David
R.
Cheriton
School
of
Computer
Science
Pretrained language models such as BERT have brought significant improvement to NLP applications. These models are based on the Transformer architecture and are pretrained on large-scale unsupervised data. Despite their success, they are also notorious for being slow in inference, which makes it difficult to deploy them in real-time scenarios.
In the talk, I will introduce a simple but effective method, DeeBERT, to accelerate BERT inference by early exiting. It allows inference samples to exit earlier after going through only a part of the BERT model. Experiments show that DeeBERT is able to save up to 36% inference time while maintaining the same model quality. Further analyses show different behaviour of transformer layers in BERT, and also reveal BERT’s redundancy.
I will also discuss different ways to train the early exit architecture effectively, and other problems encountered during the process. Our work provides a new idea to apply deep transformer-based pretrained models to downstream tasks.