Information extraction and wrapper induction

Description:

Information extraction:
The identificiation and extraction of instances of a particular class of events or relationships in a natural language text and their transformation into a structured representation (e.g. a database). (after Grishman 1997, Eikvil 1999)

Wrapper Induction:

  • Automatic generation of wrappers from a few (annotated) sample pages
  • Assumptions:
    Regularity in presentation of information often machine-generated answers to queries
    • same header
    • same tail
    • inbetween a table/list of items that constitute the answer to the query
  • Learn the delimiters between items of interest

Publications: Califf/Mooney/2003a: Bottom-up relational learning of pattern matching rules for information extraction
Kushmerick/etal/97a: Wrapper induction for information extraction