Please note: This PhD defence will take place online.
He (Richard) Bai, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Ming Li
This thesis is about modeling text and speech sequences to achieve lower perplexity, better generation, and benefit downstream language tasks; specifically, we address the problem of modeling natural language sequences (text and speech) with Transformer-based language models. We present three new techniques that improve sequence modeling in different ways.
First, we propose Segment-Aware Language Modeling to encode richer positional information for language modeling, by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. By applying our approach to Transformer-XL, we train a new language model, Segatron-XL, that achieves a 6.6–7.8% relative reduction in perplexity. Additionally, BERT pretrained with our method — SegaBERT — outperforms BERT on general language understanding, sentence representation learning, and machine reading comprehension tasks. Furthermore, our SegaBERT-large model outperforms RoBERTa-large on zero-shot STS tasks. These experimental results demonstrate that our proposed Segatron works on both language models with relative position embeddings and pretrained language models with absolute position embeddings.
Second, we propose Hypernym-Instructed Language Modeling to map words that have a common WordNet hypernym to the same class and trains large neural LMs by gradually annealing from predicting the class to token prediction during training. Class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. Empirically, this curriculum learning strategy consistently reduces perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, Wikipedia and Arxiv. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words.
Third, we propose Alignment-Aware Acoustic and Text Modeling to reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and new speaker TTS directly. Experiments show A3T outperforms SOTA models on speech editing and improves multi-speaker speech synthesis without the external speaker verification model.