Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter - Medical lexicon Dataset

Amira Ghenai

University of Waterloo

aghenai@uwaterloo.ca

This site provides the medical lexicon generated in the work presented in "Catching Zika Fever: Tracking Health Misinformation in Twitter".

See also: The full paper is available here.

We compute the medical lexicon of `infectious disease' wikipedia pages using two different corpuses:

Medical corpus (all wikipedia pages about infectious disease)
Wikipedia corpus (top words in all wikipedia)

For every word in every corpus, we compute the probabilities as follows:

In the medical corpus, we compute the probability of every word as follows: mp_w (medical corpus probability of word) = frequency of w in medical corpus / total number of words in medical corpus.
In the wikipedia corpus, we compute the probability of every word as follows: wp_w = frequency of w in wikipedia / total number of wikipedia words.
Then, for every word w in both corpus:
- If w is in the medical corpus but not in the wikipedia corpus or the opposite is true, p_w = 0.
- if w is in both the medical and wikipedia corpus, p_w = mp_w - wp_w (medical corpus probability - wikipedia corpus probability)
Finally, we sort words with p_w in descending order and pick the top words to form the corpus.

The set of medical and wikipedia lexicons may be downloaded here:

Both the medical_corpus.txt and the wikipedia_corpus.txt files contain 22123 words. Every word is in a separate line and every line is in the format of: WORD [TAB] FREQUENCY.
For more details, please read the paper.

Please cite the original publication when using the dataset:
Amira Ghenai, Yelena Mejova, 2017, January. Catching Zika Fever: Application of Crowdsourcing and Machine Learning for Tracking Health Misinformation on Twitter. The Fifth IEEE International Conference on Healthcare Informatics (ICHI 2017), Park City, Utah.