BOA classifier - Bag of Wikipedia Articles

SCM and THD algorithms were designed for English. While adaptation of these algorithms for other languages is conceivable, we decided to develop the Bag of Articles (BOA) algorithm, which is language agnostic as it is based on the statistical Rocchio classifier. Since this algorithm utilizes Wikipedia as a source of data for classification, it does not require any labeled training instances. WordNet is used in a novel way to compute term weights. It is also used as a positive term list and for lemmatization.

Demo

Web interface is planned but not yet available
The application can be downloaded as .jar file: Documentation, Download

Example index and experiments

microtest (given as example in the documentation): index and data, config file and sample results
- run java -jar WikiIndex.jar experiment.xml
test (still very small but with real data): index and data, config file, GAConfig file, sample results.
- run java -jar WikiIndex.jar experiment.xml
- run java -jar WikiIndex.jar experiment.xml GAConfig.accuracy for evolutionary optimization of the parameters in experiment.xml
Experiments on Czech Traveler and WordSim353 datasets
- the index size of English Wikipedia required by these experiments is about 20GB therefore it is currently not provided for space reasons

Example index and experiments - SCM (WordNet similarity measures)

Czech Traveler Dataset: data, experiments
- run as e.g. by java -jar WikiIndex.jar WordNet-all_SSM.xml
WordSim353: data, experiments
- run as e.g. by java -jar WikiIndex.jar WordNet-all_SSM.xml

Note that before running the experiments you need to adjust the paths in the experiment config and GAConfig files!

Google Sites

Report abuse