Aranea::Lexicon::Lexicon

CONTENTS


NAME

Aranea::Lexicon::Lexicon - procedure for accessing lexicons (WordNet, CELEX, etc.)


SYNOPSIS


 use Aranea::Lexicon::Lexicon;
 
 my $root = stem('does');
 # result = [ [ 'do',  'v' ],
 #            [ 'doe', 'n' ] ]
 
 my $infl = inflect_verb('sing');
 # result =
 # [ ['a1S', 'sang', ''],
 #    'a2S', 'sang', ''],
 #    'a3S', 'sang', ''],
 #    'aP', 'sang', ''],
 #    'e1S', 'sing', '@'],
 #    'e2S', 'sing', '@'],
 #    'e3S', 'sings', '@+s'],
 #    'eP', 'sing', '@'],
 #    'i', 'sing', '@'],
 #    'pa', 'sung', ''],
 #    'pe', 'singing', '@+ing'] ]
 
 my $b1 = isPreposition('of');
 # result is true
 
 my $b2 = isStopword('the');
 # result is true
 
 my $b3 = isVerbInfinitive('test');
 # result is true


DESCRIPTION

stem($word)
WordNet lemmatizer. Returns a ref to a table: the first column holds the lemma, the second column its part of speech. The rows are arranged in ascending length of the lemma forms.

isPreposition($word)
Checks to see if the argument is a preposition.

isStopWord($word)
Checks to see if the argument is a stopword. The stopwords list is taken from WordNet 1.6

isVerbInfinitive($word)
Checks to see if the argument is a verb in its infinitive form. This list of verbs is taken from START.

inflect_verb($word)
Given a verb infinitive, returns its inflection table (conjugations). The column of the table is the inflection form, the second column is the inflected verb itself, and the third column describes how the inflected form is composed.

The following cannocial inflection forms are provided (first column): a1S, a2S, a3S, aP, e1S, e2S, e3S, eP, i, pa, pe.

And here is the key: S = Singular, P = Plural, b = positive, c = comparative, s = superlative, i = Infinitive, p = Participle, e = Present tense, a = Past tense, 1 = 1st person verb, 2 = 2nd person verb, 3 = 3rd person verb, r = Rare form, X = Headword form

get_idf
Returns the inverse document frequency of a token t in the AQUAINT corpus. idf = log(N/d), where N is the total number of documents in the corpus, and d is the number of documents in the corpus that contains the term t.

get_log_tf
Returns the log term frequency of a token t in the AQUAINT corpus, i.e., log(c/W), where c is the number of occurrences of token t, and W is the total number of tokens in the corpus.