BOA classifier - Bag of Wikipedia Articles

SCM and THD algorithms were designed for English. While adaptation of these algorithms for other languages is conceivable, we decided to develop the Bag of Articles (BOA) algorithm, which is language agnostic as it is based on the statistical Rocchio classifier. Since this algorithm utilizes Wikipedia as a source of data for classification, it does not require any labeled training instances. WordNet is used in a novel way to compute term weights. It is also used as a positive term list and for lemmatization.


  • Web interface is planned but not yet available
  • The application can be downloaded as .jar file: Documentation, Download

Example index and experiments

  • microtest (given as example in the documentation): index and data, config file and sample results
    • run java -jar WikiIndex.jar experiment.xml
  • test (still very small but with real data): index and data, config file, GAConfig file, sample results.
    • run java -jar WikiIndex.jar experiment.xml
    • run java -jar WikiIndex.jar experiment.xml GAConfig.accuracy for evolutionary optimization of the parameters in experiment.xml
  • Experiments on Czech Traveler and WordSim353 datasets
    • the index size of English Wikipedia required by these experiments is about 20GB therefore it is currently not provided for space reasons

Example index and experiments - SCM (WordNet similarity measures)

  • Czech Traveler Dataset: data, experiments
    • run as e.g. by java -jar WikiIndex.jar WordNet-all_SSM.xml
  • WordSim353: data, experiments
    • run as e.g. by java -jar WikiIndex.jar WordNet-all_SSM.xml

Note that before running the experiments you need to adjust the paths in the experiment config and GAConfig files!