Please note: This PhD defence will take place online.
Georgios Michalopoulos, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors Ian McKillop, Helen Chen
The digital transformation of our society is creating a tremendous amount of data at an unprecedented rate. A large part of this data is in unstructured text format. While enjoying the benefit of instantaneous data access, we are also burdened by information overload. In healthcare, clinicians have to spend a significant portion of their time reading, writing and synthesizing data in electronic patient record systems. Information overload is reported as one of the main factors contributing to physician burnout; however, information overload is not unique to healthcare. We need better practical tools to help us access the right information at the right time. This has led to a heightened interest in high-performing Natural Language Processing research and solutions.
Natural Language Processing (NLP), or Computational Linguistics, is a sub-field of computer science that focuses on analyzing and representing human language. The most recent advancements in NLP are large pre-trained contextual language models (e.g., transformer based models), which are pre-trained on massive corpora, and their context-sensitive embeddings (i.e., learned representation of words) are used in downstream tasks. The introduction of these models has led to significant performance gains in various downstream tasks, including sentiment analysis, entity recognition, and question answering. Such models have the ability to change the embedding of a word based on its imputed meaning, which is derived from the surrounding context.
Contextual models can only encode the knowledge available in raw text corpora. Injecting structured domain-specific knowledge into these contextual models could further improve their performance and efficiency. However, this is not a trivial task. It requires a deep understanding of the model’s architecture and the nature and structure of the domain knowledge incorporated into the model. Another challenge facing NLP is the "low-resource” problem, arising from a shortage of publicly available (domain-specific) large datasets for training purposes. The low-resource challenge is especially acute in the biomedical domain, where strict regulation for privacy protection prohibits many datasets from being publicly available to the NLP community. The severe shortage of clinical experts further exacerbates the lack of labeled training datasets for clinical NLP research.
We approach these challenges from the knowledge augmentation angle. This thesis explores how knowledge found in structured knowledge bases, either in general-purpose lexical databases (e.g., WordNet) or domain-specific knowledge bases (e.g., the Unified Medical Language Systems or the International Classification of Diseases), can be used to address the low-resource problem. We show that by incorporating domain-specific prior knowledge into a deep learning NLP architecture, we can force an NLP model to learn the associations between distinctive terminologies that it otherwise may not have the opportunity to learn due to the scarcity of domain-specific datasets.
Four distinct yet complementary strategies have been pursued. First, we investigate how contextual models can use structured knowledge contained in the lexical database WordNet to distinguish between semantically similar words. We update the input policy of a contextual model by introducing a new mix-up embedding strategy for the input embedding of the target word. We also introduce additional information, such as the degree of similarity between the definitions of the target and the candidate words. We demonstrate that this supplemental information has enabled the model to select candidate words that are semantically similar to the target word rather than those that are only appropriate for the sentence’s context.
Having successfully proven that lexical knowledge can aid a contextual model in distinguishing between semantically similar words, we extend this approach to highly specialized vocabularies such as those found in medical text. We explore whether using domain-specific (medical) knowledge from a clinical Metathesaurus (UMLS Metathesaurus) in the architecture of a transformer-based encoder model can aid the model in building 'semantically enriched' contextual representations that will benefit from both the contextual learning and the domain knowledge. We also investigate whether incorporating structured medical knowledge into the pre-training phase of a transformer-based model can incentivize the model to learn more accurately the association between distinctive terminologies. This strategy is proven to be effective through a series of benchmark comparisons with other related models.
After demonstrating the effect of structured domain (medical) knowledge on the performance of a transformer-based encoder model, we extend the medical features and illustrate that structured medical knowledge can also boost the performance of a (medical) summarization transformer-based sequence-to-sequence model. We introduce a guidance signal consisting of the medical terminologies in the input sequence. Moreover, the input policy is modified by utilizing the semantic types from UMLS, and we also propose a novel weighted loss function. Our study demonstrates the benefit of these strategies in providing a stronger incentive for the model to include relevant medical facts in the summarized output.
We further examine whether an NLP model can take advantage of both the relational information between different labels and contextual embedding information by introducing a novel attention mechanism (instead of augmenting the architecture of contextual models with structured information as described in the previous paragraphs). We tackle the challenge of automatic ICD coding, which is the task of assigning codes of the International Classification of Diseases (ICD) system to medical notes. Through a novel attention mechanism, we integrate the information from a Graph Convolutional Network (GCN) that considers the relationship between various codes with the contextual sentence embeddings of the medical notes. Our experiments reveal that this enhancement effectively boosts the model's performance in the automatic ICD coding task.
The main contribution of this thesis is two-fold: (1) this thesis contributes to the computer science literature by demonstrating how domain-specific knowledge can be effectively incorporated into contextual models to improve model performance in NLP tasks that lack helpful training resources; and (2) the knowledge augmentation strategies and the contextual models developed in this research are shown to improve NLP performance in the biomedical field, where publicly available training datasets are scarce but domain-specific knowledge bases and data standards have achieved a wide adoption in electronic medical records systems.