edu.udo.cs.wvtool.crawler
Class WVToolCrawler

java.lang.Object
  extended by websphinx.Crawler
      extended by edu.udo.cs.wvtool.crawler.WVToolCrawler
All Implemented Interfaces:
java.io.Serializable, java.lang.Runnable

public abstract class WVToolCrawler
extends websphinx.Crawler

An abstract class that must be overridden by all specialized crawlers that are used to construct a crawled input list.

Version:
$Id$
Author:
Michael Wurst
See Also:
Serialized Form

Field Summary
 java.lang.String contentEncoding
          the encoding of the crawled document
 java.lang.String contentLanguage
          the language the documents are written in (english, german, ...)
 java.lang.String contentType
          the MIME content type of the crawled documents
private  java.util.Map urlsToVectorize
           
 
Fields inherited from class websphinx.Crawler
ALL_LINKS, HYPERLINKS, HYPERLINKS_AND_IMAGES, SERVER, SUBTREE, WEB
 
Constructor Summary
WVToolCrawler(java.lang.String encoding, java.lang.String language, java.lang.String type)
           
 
Method Summary
 java.util.Map getURLS()
           
protected abstract  boolean vectorizePage(websphinx.Page page)
           
 void visit(websphinx.Page page)
           
 
Methods inherited from class websphinx.Crawler
addClassifier, addCrawlListener, addLinkListener, addRoot, clear, clearVisited, enumerateClassifiers, enumerateQueue, expand, getAction, getActiveThreads, getClassifiers, getCrawledRoots, getDepthFirst, getDomain, getDownloadParameters, getIgnoreVisitedLinks, getLinkPredicate, getLinksTested, getLinkType, getMaxDepth, getName, getPagePredicate, getPagesLeft, getPagesVisited, getRootHrefs, getRoots, getState, getSynchronous, main, markVisited, pause, removeAllClassifiers, removeClassifier, removeCrawlListener, removeLinkListener, run, sendCrawlEvent, sendLinkEvent, sendLinkEvent, setAction, setDepthFirst, setDomain, setDownloadParameters, setIgnoreVisitedLinks, setLinkPredicate, setLinkType, setMaxDepth, setName, setPagePredicate, setRoot, setRootHrefs, setRoots, setSynchronous, shouldVisit, stop, submit, submit, toString, visited
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

urlsToVectorize

private final java.util.Map urlsToVectorize

contentType

public java.lang.String contentType
the MIME content type of the crawled documents


contentEncoding

public java.lang.String contentEncoding
the encoding of the crawled document


contentLanguage

public java.lang.String contentLanguage
the language the documents are written in (english, german, ...)

Constructor Detail

WVToolCrawler

public WVToolCrawler(java.lang.String encoding,
                     java.lang.String language,
                     java.lang.String type)
Method Detail

visit

public void visit(websphinx.Page page)
Overrides:
visit in class websphinx.Crawler

vectorizePage

protected abstract boolean vectorizePage(websphinx.Page page)

getURLS

public final java.util.Map getURLS()