BOA Configuration

This document lists configuration options for our classifier program WikiIndex.jar. While it is primarily an implementation of BOA, it also allows to use WordNet similarity measures, which are at the core of the SCM algorithm.

The result of the program is a Spearman rank correlation coefficient or accuracy, depending on the experiment type.


1  Experiment Types

The implementation allows to use several experiment types. The root element of the configuration file defines the name of the ExperimentConfig class, which will be used to parse the configuration file (e.g. evaluation.BOAExperimentConfig in config file from Sec. 4). The name of the class, which will carry out the experiment is the name of the root element without the “Config” suffix, e.g. evaluation.BOAExperiment.

The following experiment types are currently defined:

  • BOAExperiment is intended for classification experiments with the BOA algorithm. The ground-truth is given in the form of a correct class for unlabeled instance. The return value of this experiment is accuracy, accuracy=correct/all, where correct is a count of testing instances for which the classification result is the same as the entry in the ground-truth file, and all is the total number of testing instances involved.
  • SCMExperiment is intended for classification experiments as described above with SCM.
  • WordSim353 is intended for experiments on the WordSim353 collection, where the BOA classifier is used to compute a similarity between two input instances. The ground-truth is given in the form of similarity values given for the input word pairs. The configuration file for this type of experiment does not contain the EntityClassifierTestingConfig element. The input dataset consists of pairs of terms, each pair being assigned a similarity score. These similarity scores can be translated to a total order of the pairs. The result of the experiment is a value of the Spearman Rank correlation coefficient.
  • WordSim353Wordnet is intended for experiments on WordSim353 with SCM.

Individual experiments are executed using the evaluation.ExperimentRunner class with one argument – the path to the XML configuration file:

 java -jar WikiIndex.jar ExperimentConfig.xml

2  Experiment Configuration File

Experiment configuration file is an XML file with a number of parameters, which can be divided into several groups: global, search-related, modality and term-weighting functions.

The names of parameters often include a full path to the class, which uses the parameter. For the sake of brevity, the configuration files were simplified by abridging these rather lengthy parameter names. Parameters common for both SCM and BOA are listed in Sec. 2.1. Settings specific to BOA are listed in Subs. 2.2, and settings specific to SCM in Subs. 2.2.

In addition to the description of the parameters, we give two examples. An example for BOA experiment on a classification task is given in Appendix 4. An example for SCM on a WordSim353 task is given in Appendix 5.

2.1  Common Parameters

Tables 3-9 give an overview of the majority of available parameters, which are common for BOA and SCM. Parameters additionalDebugFilesDir, debugLogPath and protocolPath can be missing. In that case, directory where the experiment file resides is the BASE and protocolPath is set to BASE/protocol.xml, debugLogPath to BASE/debug_details directory and debugLogPath to BASE/debug.log.


Global parameters
experimentClassenumeither evaluation.WordSim353 (BOA), evaluation.WordSim353WordNet (SCM) for computation on WordSim353 or similar dataset with Spearman correlation as result or evaluation.BOAExperiment (BOA), evaluation.SCMExperiment (SCM) for classification task with accuracy as the result
Global path parameters
loggerLevelenumlogging granularity (ERROR, INFO, DEBUG, TRACE)
consoleLoggingEnabledbooelanif set to true, log messages will be sent also to standard output
protocolPathstringdetailed result of the experiment run is saved into this file
debugLogPathstringlogging messages will be saved to this file
additionalDebugFilesDirstringif loggerLevel is set to DEBUG, one .csv file per vector similarity computation will be saved to this directory
Table 1: Global technical parameters.

2.2  BOA Specific Parameters


Global parameters
similarityFunctionenumeither dotProduct or cosineSim
wiki_linksDirstringpath to Lucene wiki.links directory
wiki_mainIndexDirstringpath to Lucene wiki directory
wiki_useRAMDirbooleanLucene index will be loaded to RAM Directory. Note that for English Wikipedia index this requires an excessive amount of RAM.
Table 2: Global Parameters.


EntityClassifierTrainingConfig – Training parameters
entityNamesAreWikipediaArticleTitlesbooleanthe input strings are considered as titles of Wikipedia articles (true), as noun phrases (false)
maxTermVectorLengthintegermaximum length of term vectors return by σ
skipUnresolvedTitlesbooleanIf the entity cannot be mapped to Wikipedia, the training quits if set to false, if set to true the training continues with the unmapped training class being omitted.
stopWordListPathstringEmpty if stop word list is not to be used.
Table 3: Training Parameters


EntityClassifierSearchConfig
disambiguationCutoffintegernumber of best matching articles to retrieve per query. Default is 1, greater values leave space for disambiguation
searchBackupPagesintegernumber of search results to ask the search engine in addition to disambiguationCutoff. The excess hits are used if the system fails to parse some of the top disambiguationCutoff results.
searchServiceURLstringlocation of the routerAPISeach.php script wrapping the Wikipedia Lucene Extension search
searchTypeenumpossible values are rawexplain, search. Refer to Lucene search documentation.
pathToFileWithLuceneArticleKeysstringentity names (first column) are replaced with Wikipedia article titles (second column).
searchProviderenumService – search by service at searchServiceURL, KeysFromFile use keys at pathToFileWithLuceneArticleKeys
Table 4: Search Parameters


MultipleWeightSparseTermVectorType
TV_Charenumalways value TF
TV_Scopeenumalways value entity
TV_UseTypeenumvalue training or testing
Table 5: Basic Term Weighting Parameters


Basic Wordnet Config
discardTermsNotInWordnetboolean
TVChar_Wordnet_JWNL
infoContentFileNamestringpath to file with precomputed information content values
jwnlinitPathstringpath to file with WordNet setup (options: file-based/memory-based); influences speed but not results
TVChar_Wordnet_JWSL
wordnetLucenefolderstringpath to the JWSL Lucene index directory
Table 6: Basic WordNet Config


Modalities: TV_Linkxxx
Possible values:TV_LinkSimByCat, TV_LinkOut,TV_LinkIN
crawlingDepthintegercorresponds to Lmaxm threshold
WeightingFactor⟨ 0,1 ⟩corresponds to Wm
WeightFactor_levelx⟨ 0,1 ⟩corresponds to Wm,l. Must be set for x=0 … crawlingDepth.
maxLinksToFollowintegernumber of articles in level n+1 related to article a on level n to use
articleSelectionStrategyenumfirstn or mostsim
Table 7: Modality Config Parameters


Term weighting functions: TVChar_xxx
where xxx ∈ {TermFrequency, IDFentireWikipedia,IDFtrainingSet, Wordnet_Aggregate, Wordnet_JWNL, WordNet_JWSL }
WeightFactor_levelx⟨ 0;1 ⟩corresponds to weight Wm,l,t. Must be set for x=0 … crawlingDepth.
Table 8: Term Weighting Function Config Parameters


Additional settings for TVChar_Wordnet_xxx
where xxx ∈ {Wordnet_Aggregate, Wordnet_JWNL, WordNet_JWSL
roundToZeroIfUnderWordnetSimThreshold_levelxfloatcorresponds to weight Tllow , this property must be present for x=0 … crawlingDepth
roundToOneIfAboveWordnetSimThreshold_levelxfloatcorresponds to weight Tlhigh, this property must be present for x=0 … crawlingDepth
Additional settings for TVChar_Wordnet_JWNL
Wordnet_simMetricenumshef.nlp.wordnet.similarity.{JCn,Lin}
Additional settings for TVChar_Wordnet_JWSL
Wordnet_simMetricenum{Resnik, Jiang, Lin, Pirro and Seco}
Additional settings for TVChar_Wordnet_Aggregate
WordnetWeightnested structure
Table 9: WordNet Term Weighting Function Specific Config Parameters

2.3  SCM Specific Parameters

SCM contains a subset of parameters available for BOA. There are only two SCM specific parameters:

  • JWSLMeasures element lists a semicolon separated list of JWSL WordNet measures to be used,
  • JWNLMeasures element lists a semicolon separated list of JWordnetSim measures to be used.

3  Parameter Estimation

Parameter estimation is executed by the following command:

 java -jar WikiIndex.jar BOAConfig.xml GAConfig

The two arguments are:

  • path to the XML file, which is a normal BOA configuration file as e.g. exemplified in Sec. 4 (BOAConfig.xml),
  • path to the Genetic Algorithm Configuration file (GAConfig), which contains the setting of the genetic algorithm and a list of parameters that should be subject of optimization.

3.1  GAConfig Configuration File

This file consists of two sections. The first section contains generic settings for the genetic algorithm, and the second part defines features that are subject to evolution. The syntax for entries in the first part is simple: the name of the parameter is followed by space and then by value.

The syntax for the second part is following:

parameter name,min value,max value,parameter type,context  

The context is given by a regular expression. The name of the parameter is searched in the config file fragment matching the context and replaced by a new value.

An example of GAConfig file for a BOA classification experiment follows.

This example defines 3 features that do not depend on phase, 17 features for the training phase, and 9 features for the classification phase.

 
maxGenerations 50
populationSize 60
maxThreads 8
maxGensWithoutImprovement 5
mutationProb 0.2
experimentExecutionType separateJVM
executionCommandforJVMExecutionType java -jar -Xmx2000M ~/WikiIndex.jar 
debugLogPath /home/tomas/code/WIKIENTITYCLAS/WikiIndex/experiments/test/GA.log

wikiindex.characteristic.TVChar_Wordnet_JWNL_discardTermsNotInWordnet, true;false, , enum, .*
wikiindex.config.EntityClassifierTrainingConfig_maxTermVectorLength, 10, 50, gaussianInteger, .*
evaluation.BOAExperimentConfig_similarityFunction, dotProduct;cosineSim, ,enum, .*

wikiindex.characteristic.TV_LinkOut_crawlingDepth, 0;1;2, , enum, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_WeightFactor_level0, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_WeightFactor_level1, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_WeightFactor_level2, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_maxLinksToFollow, 1, 20, integer, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_weightingFactor, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_articleSelectionStrategy, firstn;mostsimilar, ,enum, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TV_LinkOut_aggregationType, CustomAggregator_PreserveBasicWeight;WeightedGeometricAverage;CustomAggregator, ,enum, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig

wikiindex.characteristic.TVChar_TermFrequency_WeightFactor_level0, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_TermFrequency_WeightFactor_level1, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_TermFrequency_WeightFactor_level2, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFentireWikipedia_WeightFactor_level0, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFentireWikipedia_WeightFactor_level1, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFentireWikipedia_WeightFactor_level2, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFtrainingSet_WeightFactor_level0, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFtrainingSet_WeightFactor_level1, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig
wikiindex.characteristic.TVChar_IDFtrainingSet_WeightFactor_level2, 0, 1, float, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTrainingConfig

wikiindex.characteristic.TV_LinkIN_crawlingDepth, 0;1, , enum, TV_LinkOut.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>.*?EntityClassifierTestingConfig
wikiindex.characteristic.TV_LinkIN_WeightFactor_level0, 0, 1, float, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TV_LinkIN_WeightFactor_level1, 0, 1, float, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TV_LinkIN_maxLinksToFollow, 1, 20, integer, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TV_LinkIN_weightingFactor, 0, 1, float, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TV_LinkIN_articleSelectionStrategy, firstn;mostsimilar, ,enum, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TV_LinkIN_aggregationType, CustomAggregator_PreserveBasicWeight;WeightedGeometricAverage;CustomAggregator, ,enum, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>

wikiindex.characteristic.TVChar_TermFrequency_WeightFactor_level0, 0, 1, float, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>
wikiindex.characteristic.TVChar_TermFrequency_WeightFactor_level1, 0, 1, float, EntityClassifierTestingConfig.*?TV_LinkIN.*?</wikiindex.termvector.MultipleWeightSparseTermVectorType>

It should be noted that no integrity checks are performed to ensure that value changes within the provided bounds will generate a valid configuration file. In the listing provided above, changing

wikiindex.characteristic.TV_LinkOut_crawlingDepth, 0;1;2, , enum, TV_LinkOut

to

wikiindex.characteristic.TV_LinkOut_crawlingDepth, 0;1;2;3, , enum, TV_LinkOut

can result in an invalid configuration if the crawlingDepth feature is set to value 3 through mutation. In this case, the implementation will search for level 3 parameters in the BOA Config XML file (refer to Sec. 4), such as:

<LinkOut_WeightFactor_level3>*</LinkOut_WeightFactor_level3>

If the corresponding parameters are not present, the program will finish with an error. However, the BOA config file can contain extra parameters. For example, setting crawlingDepth to 1 for the in-link modality will result in a valid configuration, the extra parameters for level 2 which may be present in the configuration file will be ignored. An example of such a parameter is

<LinkIN_WeightFactor_level2>*</LinkIN_WeightFactor_level2>

4  BOA Experiment Config Example

Note that some lines which are not needed by the example, but are required for various integrity and technical reasons, were omitted from the listing.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<evaluation.BOAExperimentConfig>
 <additionalDebugFilesDir>debug-details</additionalDebugFilesDir>
 <consoleLoggingEnabled>true</consoleLoggingEnabled>
 <debugLogPath>debug.log</debugLogPath>
 <experimentClass>evaluation.BOAExperiment</experimentClass>
 <experimentName>testExperiment</experimentName>
 <groundtruthFile>groundtruth.csv</groundtruthFile>
 <groundtruth_col>0</groundtruth_col>
 <loggerLevel>DEBUG</loggerLevel>
 <protocolPath>protocolf1.csv</protocolPath>
 <serializationPath>serializedClassifier</serializationPath>
 <test_col>0</test_col>
 <test_from>0</test_from>
 <test_to>0</test_to>
 <testingFile>testing.csv</testingFile>
 <train_entityname_col>0</train_entityname_col>
 <train_col>1</train_col>
 <train_from>0</train_from>
 <train_to>1</train_to>
 <trainingFile>training.csv</trainingFile>
 <similarityFunction>cosineSim</similarityFunction> 
 <wiki_linksDir>wiki.links</wiki_linksDir>
 <wiki_mainIndexDir>wiki</wiki_mainIndexDir>
 <wikiindex.config.SerializationPolicyEnum>noserialize</wikiindex.config.SerializationPolicyEnum>
 <wikiindex.config.EntityClassifierTrainingConfig>
  <wikiindex.config.EntityClassifierTrainingConfig>
   <entityNamesAreWikipediaArticleTitles>true</entityNamesAreWikipediaArticleTitles>
   <maxTermVectorLength>10</maxTermVectorLength>
   <skipUnresolvedTitles>false</skipUnresolvedTitles>
   <stopWordListPath>stopwordlist.txt</stopWordListPath>
   <wikiindex.config.EntityClassifierSearchConfig>
    <wikiindex.config.EntityClassifierSearchConfig>
     <disambiguationCutoff>1</disambiguationCutoff>
     <searchBackupPages>10</searchBackupPages>
     <searchServiceURL>routerAPISeach.php</searchServiceURL>
     <searchType>search</searchType>
     <pathToFileWithLuceneArticleKeys>searchkeys.csv</pathToFileWithLuceneArticleKeys>
     <searchProvider>KeysFromFile</searchProvider>
    </wikiindex.config.EntityClassifierSearchConfig>
   </wikiindex.config.EntityClassifierSearchConfig>
   <wikiindex.termvector.MultipleWeightSparseTermVectorType>
    <wikiindex.termvector.MultipleWeightSparseTermVectorType>
     <wikiindex.characteristic.TV_Char>TF</wikiindex.characteristic.TV_Char>
     <wikiindex.characteristic.TV_Scope>entity</wikiindex.characteristic.TV_Scope>
     <wikiindex.characteristic.TV_UseType>training</wikiindex.characteristic.TV_UseType>
     <LinkType>
      <wikiindex.characteristic.TV_LinkOut>
       <LinkOut_WeightFactor_level0>0.5</LinkOut_WeightFactor_level0>
       <LinkOut_WeightFactor_level1>0.4</LinkOut_WeightFactor_level1>
       <LinkOut_WeightFactor_level2>0.1</LinkOut_WeightFactor_level2>
       <LinkOut_crawlingDepth>2</LinkOut_crawlingDepth>
       <LinkOut_maxLinksToFollow>3</LinkOut_maxLinksToFollow>
       <LinkOut_weightingFactor>0.4</LinkOut_weightingFactor>
       <LinkOut_articleSelectionStrategy>firstn</LinkOut_articleSelectionStrategy>
       <LinkOut_aggregationType>WeightedGeometricAverage</LinkOut_aggregationType>
      </wikiindex.characteristic.TV_LinkOut>
     </LinkType>
     <TermVectorChars>
      <wikiindex.characteristic.TVChar_TermFrequency>
       <TermFrequency_WeightFactor_level0>0.3</TermFrequency_WeightFactor_level0>
       <TermFrequency_WeightFactor_level1>0.4</TermFrequency_WeightFactor_level1>
       <TermFrequency_WeightFactor_level2>0.5</TermFrequency_WeightFactor_level2>
      </wikiindex.characteristic.TVChar_TermFrequency>
      <wikiindex.characteristic.TVChar_IDFentireWikipedia>
       <IDFentireWikipedia_WeightFactor_level0>0.7</IDFentireWikipedia_WeightFactor_level0>
       <IDFentireWikipedia_WeightFactor_level1>0.6</IDFentireWikipedia_WeightFactor_level1>
       <IDFentireWikipedia_WeightFactor_level2>0.5</IDFentireWikipedia_WeightFactor_level2>
      </wikiindex.characteristic.TVChar_IDFentireWikipedia>
     </TermVectorChars>
    </wikiindex.termvector.MultipleWeightSparseTermVectorType>
    <wikiindex.termvector.MultipleWeightSparseTermVectorType>
     <wikiindex.characteristic.TV_Char>TF</wikiindex.characteristic.TV_Char>
     <wikiindex.characteristic.TV_Scope>entity</wikiindex.characteristic.TV_Scope>
     <wikiindex.characteristic.TV_UseType>training</wikiindex.characteristic.TV_UseType>
     <LinkType>
      <wikiindex.characteristic.TV_LinkIN>
       <LinkIN_WeightFactor_level0>0.5</LinkIN_WeightFactor_level0>
       <LinkIN_WeightFactor_level1>0.5</LinkIN_WeightFactor_level1>
       <LinkIN_articleSelectionStrategy>firstn</LinkIN_articleSelectionStrategy>
       <LinkIN_crawlingDepth>1</LinkIN_crawlingDepth>
       <LinkIN_maxLinksToFollow>20</LinkIN_maxLinksToFollow>
       <LinkIN_weightingFactor>0.6</LinkIN_weightingFactor>
       <LinkIN_aggregationType>WeightedGeometricAverage</LinkIN_aggregationType>
      </wikiindex.characteristic.TV_LinkIN>
     </LinkType>
     <TermVectorChars>
      <wikiindex.characteristic.TVChar_TermFrequency>
       <TermFrequency_WeightFactor_level0>0.6</TermFrequency_WeightFactor_level0>
       <TermFrequency_WeightFactor_level1>0.5</TermFrequency_WeightFactor_level1>
      </wikiindex.characteristic.TVChar_TermFrequency>
      <wikiindex.characteristic.TVChar_IDFtrainingSet>
       <IDFtrainingSet_WeightFactor_level0>0.4</IDFtrainingSet_WeightFactor_level0>
       <IDFtrainingSet_WeightFactor_level1>0.5</IDFtrainingSet_WeightFactor_level1>
      </wikiindex.characteristic.TVChar_IDFtrainingSet>
     </TermVectorChars>
    </wikiindex.termvector.MultipleWeightSparseTermVectorType>
   </wikiindex.termvector.MultipleWeightSparseTermVectorType>
  </wikiindex.config.EntityClassifierTrainingConfig>
 </wikiindex.config.EntityClassifierTrainingConfig>
 <wikiindex.config.EntityClassifierTestingConfig>
  <wikiindex.config.EntityClassifierTestingConfig>
   <entityNamesAreWikipediaArticleTitles>true</entityNamesAreWikipediaArticleTitles>
   <skipUnresolvedTitles>true</skipUnresolvedTitles>
   <testingNBestToRetainInProtocol>10</testingNBestToRetainInProtocol>
   <wikiindex.termvector.MultipleWeightSparseTermVectorType>
    <wikiindex.termvector.MultipleWeightSparseTermVectorType>
     <wikiindex.characteristic.TV_Char>TF</wikiindex.characteristic.TV_Char>
     <wikiindex.characteristic.TV_Scope>entity</wikiindex.characteristic.TV_Scope>
     <wikiindex.characteristic.TV_UseType>testing</wikiindex.characteristic.TV_UseType>
     <LinkType>
      <wikiindex.characteristic.TV_LinkIN>
       <LinkIN_WeightFactor_level0>0.3</LinkIN_WeightFactor_level0>
       <LinkIN_WeightFactor_level1>0.5</LinkIN_WeightFactor_level1>
       <LinkIN_WeightFactor_level2>0.2</LinkIN_WeightFactor_level2>
       <LinkIN_articleSelectionStrategy>firstn</LinkIN_articleSelectionStrategy>
       <LinkIN_crawlingDepth>2</LinkIN_crawlingDepth>
       <LinkIN_maxLinksToFollow>20</LinkIN_maxLinksToFollow>
       <LinkIN_weightingFactor>1.0</LinkIN_weightingFactor>
       <LinkIN_aggregationType>WeightedGeometricAverage</LinkIN_aggregationType>
      </wikiindex.characteristic.TV_LinkIN>
     </LinkType>
     <TermVectorChars>
      <wikiindex.characteristic.TVChar_TermFrequency>
       <TermFrequency_WeightFactor_level0>1.0</TermFrequency_WeightFactor_level0>
       <TermFrequency_WeightFactor_level1>1.0</TermFrequency_WeightFactor_level1>
       <TermFrequency_WeightFactor_level2>1.0</TermFrequency_WeightFactor_level2>
      </wikiindex.characteristic.TVChar_TermFrequency>
     </TermVectorChars>
    </wikiindex.termvector.MultipleWeightSparseTermVectorType>
   </wikiindex.termvector.MultipleWeightSparseTermVectorType>
   <wikiindex.config.EntityClassifierSearchConfig>
    <wikiindex.config.EntityClassifierSearchConfig>
     <disambiguationCutoff>1</disambiguationCutoff>
     <pathToFileWithLuceneArticleKeys>searchkeys.csv</pathToFileWithLuceneArticleKeys>
     <searchProvider>KeysFromFile</searchProvider>
     <searchType>search</searchType>
    </wikiindex.config.EntityClassifierSearchConfig>
   </wikiindex.config.EntityClassifierSearchConfig>
  </wikiindex.config.EntityClassifierTestingConfig>
 </wikiindex.config.EntityClassifierTestingConfig>
</evaluation.BOAExperimentConfig>

5  SCM Experiment Config Example

Below is a sample configuration for a WordSim353 experiment with SCM using all JWordnetSim measures (JWNL in the config file) and all JWSL measures using the most frequent sense strategy for both libraries.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<evaluation.WordSim353WordNetConfig>
 <additionalDebugFilesDir>debug-details</additionalDebugFilesDir>
 <consoleLoggingEnabled>false</consoleLoggingEnabled>
 <debugLogPath>debug.log</debugLogPath>
 <experimentClass>evaluation.WordSim353WordNet</experimentClass>
 <experimentName>testExperiment</experimentName>
 <groundtruthFile>combined.csv</groundtruthFile>
 <groundtruth_col>2</groundtruth_col>
 <loggerLevel>INFO</loggerLevel>
 <protocolPath>protocolf1.csv</protocolPath>
 <serializationPath>serializedClassifier</serializationPath>
 <test_col>0</test_col>
 <test_from>1</test_from>
 <test_to>353</test_to>
 <train_col>1</train_col>
 <train_from>1</train_from>
 <train_to>353</train_to>
 <trainingFile>combined.csv</trainingFile>
 <testingFile>combined.csv</testingFile>  
 <wikiindex.config.SerializationPolicyEnum>noserialize</wikiindex.config.SerializationPolicyEnum>
 <useJWSL>true</useJWSL>
 <useJWNL>true</useJWNL>
 <similarityFunction>dotProduct</similarityFunction>
 <JWSL_wordnetLucenefolder>wn_index</JWSL_wordnetLucenefolder>
 <JWNL_WordnetICfolder></JWNL_WordnetICfolder>
 <JWNL_infoContentFileName>ic-bnc-resnik-add1.dat</JWNL_infoContentFileName>
 <JWNL_jwnlinitPath>map_propertiesWN20.xml</JWNL_jwnlinitPath>
 <JWSLMeasures>Resnik;Jiang;Lin;Pirro and Seco</JWSLMeasures>
 <JWNLMeasures>shef.nlp.wordnet.similarity.JCn;shef.nlp.wordnet.similarity.Lin</JWNLMeasures>
 <JWSL_senseSelectionStrategy>MostFrequentSense</JWSL_senseSelectionStrategy>
 <JWNL_senseSelectionStrategy>MostFrequentSense</JWNL_senseSelectionStrategy>
</evaluation.WordSim353WordNetConfig>

6  Other Configuration Files

The configuration XML file references four file paths training.csv, testing.csv,
groundtruth.csv and searchkeys.csv.

There is also a column number associated with each of the first three files, which allows to use only one file and store the information in different columns. In all these files semicolon is used to separate columns. The use of the first three files depends on the activated ExperimentConfig class as denoted by the root element of the configuration file. We will therefore first describe the searchkeys.csv, which is common for all experiment types.

The searchkeys.csv file is used to map a noun phrase to an entity article. Each line corresponds to one mapping, the first entry is the noun phrase and the second entry the title of the entity article. This file is used for benchmarking to avoid repeated time-intensive disambiguation of the same noun phrase. The use of this file is setup independently for test and training phase, one file can also be used for both phases.

6.1  Word Similarity Computation Task

This section applies to BOAExperimentConfig and SCMExperimentConfig experiment types.

  • training.csv lists target classes. The name of the class is extracted from column on position train_entityname_col which is optionally followed by the name of one entity article in train_col. If the second column is missing, the name of the class is interpreted as a noun phrase. Only the lines with numbers falling in range of the train_from and train_to parameters (zero-based, inclusive the bounds) are processed.
  • testing.csv lists unlabeled instances (noun phrases) in col test_col. Only the lines with numbers falling in range of the test_from and test_to parameters (zero-based, inclusive the bounds) are processed.
  • groundtruth.csv contains in column groundtruth_col the name of the correct target class for the unlabeled instance identified by the line number. Only the lines with numbers falling in range of the test_from and test_to parameters (zero-based, inclusive the bounds) are processed.

These files for the Toy example look as follows. For groundtruth.csv we assume that class 1 was given as correct class for the testing entity.

[caption={training.csv}]
class 1;a1
class 2;a3
[caption={testing.csv}]
t5 t6 t8
[caption={groundtruth.csv}]
class 1
[caption={searchkeys.csv}]
"t5 t6 t8";"a5"
"class 1";"a1"

6.2  Classification Task

This section applies to WordSim353Config and WordSim353WordNetConfig experiment types.

  • training.csv lists the first word in the pair identified by the line number.
  • testing.csv lists the second word in the pair identified by the line number.
  • groundtruth.csv lists the average similarity value for the pair identified by the line number

Only the lines with numbers falling in range of the test_from and test_to parameters (zero-based, inclusive the bounds) are processed. The train_from and train_to parameters are ignored.

7  Creating the Index

The implementation offers a utility which creates a Lucene “Wikipedia” index from arbitrary data provided by the user. This index can be used in place of the index produced by Lucene-Search Mediawiki Extension from Wikipedia dumps.

The Index utility is executed by running WikiIndex.jar program with one parameter – path to the root directory with index files.

 java -jar  WikiIndex data/microtest/

The input data for the utility need to be placed in the docs and linkdocs subdirectories of the root directory. This structure is for the Toy example as follows:

 root (dir)
  - wiki  (dir)
    - docs (dir)
      - a1.1.cat (file)
      - a2.2.cat (file)
      - a3.3.cat (file)
      - a4.4.cat (file)
      - a5.5.cat (file)
      - a6.6.cat (file)
    - linkdocs (dir)
      - a1.1 (file)
      - a2.2 (file)
      - a3.3 (file)
      - a4.4 (file)
      - a5.5 (file)
      - a6.6 (file)      

The tool supports three modalities: in-link, out-link and same category.

The docs directory contains files with the following mask: article name.id.category name*. Each file (article) needs to have at least one category, however multiple categories are also allowed. The content of the files corresponds to the content of the articles. For example, the content of the file a1.1.cat:

t1 t2 t2 

The files in the linkdocs directory follow the mask articlename.id and contain the links that lead from the article identified by the filename. The target articles are listed one per line and are identified by the article title.

For example, the content of the file a1.1:

a2
a3

The output of the program are wiki and wiki.links subdirectories created in the root directory containing the Main and Links Lucene indexes.


This document was translated from LATEX by HEVEA.
Comments