This dir contains the following scripts:

sentSimilarity.pl
process_td.pl
rank_new.pl
new_idf.pl

and the following data files:
myDictionary.txt.sorted
stoplist.txt.

Installation Steps:
- You need to install perl, and some libraries from cpan to run the script.
- CPAN libraries: Text::Similarity::Overlaps
                    Text::OverlapFinder
                    Getopt::Long

Steps to Produce Semantically Related Words:
You can follow the following steps to produce semantically related words and rank the word list:
1. Given an input file which contains a list of sentences (e.g., commons.sentence), first use new_idf.pl to produce idf score for every word in commons.sentence.
Usage: ./new_idf.pl <input file> <output prefix>

You will get tfidf file (e.g., commons.tf) which contains mappings between word and its idf value.

2. Use sentSimilarity.pl to produce semantically related words from input file.
Usage: ./sentSimilarity.pl -i <input file> -thres <threshold (default 0.6)> -output <output file> -idf <tfidf file>

You can set different configurations (e.g., thresholds, gap, whether to use idf or not, etc). More detailed information can be found in sentSimilarity.pl setting part and our MSR'12 paper.

3. Use process_td.pl to process the raw output from sentSimilarity.pl. This script will remove duplicates, perform stemming.
Usage: ./process_td.pl <intput file> <output file (should end with .csv)> <output cluster file>

4. Use rank_new.pl to rank the output from process_td.pl (note that the output file not the cluster file)
Usage: ./rank_new.pl <prefix of the output file from step 3>
The output file prefix_new_rank.csv will be the ultimate ranking file, and the ranking is based on average idf value and support.
