Targeted Hypernym Discovery (THD) - mining hypernyms from Wikipedia

For a query entity (noun phrase), the Targeted Hypernym Discovery (THD) algorithm extracts a hypernym from a Wikipedia article defining the noun phrase using lexico-syntactic patterns. This hypernym can be used within the SCM classifier to map the noun phrase to a WordNet synset, but it can also be perceived as the classification result by itself, achieving an unsupervised classification system.

Demo

The Workflow and components

Running THD within GATE requires two pipelines: the corpus acquisition pipeline and the corpus annotation pipeline with Wikipedia articles (see Figure).

  • The corpus acquisition pipeline contains only the WikipediaPR, which populates the corpus for a given hypernym query.
  • The corpus annotation pipeline uses predominantly existing GATE modules to perform text preprocessing. The only exception is the QueryHighlightPR, used for highlighting the hypernym query in the text, which is the only contributed PR to this pipeline. Noun phrases are identified using Ramshaw-and-Marcus noun phrase chunker available in GATE. Other modules come from the GATE reference information retrieval and extraction system ANNI. The JAPE transducer needs to be configured to use the provided hearst.jape grammar.

Download

The application consists of two GATE modules and a JAPE grammar:

  • WikipediaPR
  • QueryHighlightPR
  • JAPE grammar

Compatibility (tested): GATE 4, GATE 6, GATE 7

Installation and usage instructions:

  1. Unpack WikipediaPR.zip and QueryHighlightPR.zip to GATE plugins folder.
  2. Load the PR by checking Load Now or Load Always (recommended) in GATE: File -> Manage Creole Plugins
  3. Make sure the following standard GATE PRs are loaded: Tagger_NP_Chunking, ANNIE
  4. Create the GATE acquisition pipeline (screenshot) and the corpus annotation pipeline (screenshot) according to the figure above
  5. The result of the JAPE transducer is a one-word hypernym, which is annotated with the "isa" annotation (screenshot).
  6. The NounChunk annotation embedding the "isa" annotation provides a more precise "full hypernym".