edu.udo.cs.wvtool.wordlist
Class WVTWordList

java.lang.Object
  extended by edu.udo.cs.wvtool.wordlist.WVTWordList

public class WVTWordList
extends java.lang.Object

This class represents a word list. It is used to store information about individual words, to count words and to calculate the vectors.

Version:
$Id: WVTWordList.java,v 1.3 2006/06/16 14:59:43 mjwurst Exp $
Author:
Michael Wurst

Field Summary
private  boolean appendWords
          indicates, whether missing words should be added to the list
private  int numClasses
          the number of possible class values
private  int numDocuments
          the number of documents processed so far
private  int numLocalTerms
          the number of terms processed in the current document so far
private  boolean updateOnlyCurrent
          indicates, whether the document and class frequencies should be updated as well, or only the frequencies for the current document
private  java.util.List wordList
          A sequential indexing structure, to ensure a fixed order of all words in the list
private  java.util.Map wordMap
          A Hash used to find words efficiently
 
Constructor Summary
WVTWordList(int numClasses)
          Create a new instance of WVTWordList.
WVTWordList(java.util.List words, int numClasses)
           
WVTWordList(java.io.Reader in)
          Create a new instance of WVTWordList by reading it from a stream.
 
Method Summary
 void addWordOccurance(java.lang.String word)
          Count the occurance of the given word.
 void closeDocument(WVTDocumentInfo d)
          Used to reset the calculation for individual documents after the given document has been processed.
 int[] getClassFrequencies(int classValue)
          Get the document frequencies of documents having a given class value.
 int[] getDocumentFrequencies()
          Get the document frequencies.
 int[] getFrequenciesForCurrentDocument()
          Get the word frequencies for the document that is currently processed.
 int getFrequencyByRank(int p)
          Returns the document frequency of the word that is on the p-th rank, assuming that each word occupies exactly one rank.
 int getNumDocuments()
          Returns the numDocuments.
 int getNumWords()
          Return the number of words in the list.
 int getTermCountForCurrentDocument()
           
 java.lang.String getWord(int index)
          Returns the WVTWord with the given index.
 boolean isAppendWords()
          Returns the appendWords.
 boolean isUpdateOnlyCurrent()
          Returns the updateOnlyCurrent.
 void pruneByFrequency(int min, int max)
          Prune the word list by document frequencies.
 void setAppendWords(boolean appendWords)
          Sets the appendWords.
 void setUpdateOnlyCurrent(boolean updateOnlyCurrent)
          Sets the updateOnlyCurrent.
 void store(java.io.Writer out)
          Write the wordlist to a stream.
 void storePlain(java.io.Writer out)
          Write the wordlist to a stream without any additional info.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordMap

private java.util.Map wordMap
A Hash used to find words efficiently


wordList

private java.util.List wordList
A sequential indexing structure, to ensure a fixed order of all words in the list


numClasses

private int numClasses
the number of possible class values


appendWords

private boolean appendWords
indicates, whether missing words should be added to the list


updateOnlyCurrent

private boolean updateOnlyCurrent
indicates, whether the document and class frequencies should be updated as well, or only the frequencies for the current document


numDocuments

private int numDocuments
the number of documents processed so far


numLocalTerms

private int numLocalTerms
the number of terms processed in the current document so far

Constructor Detail

WVTWordList

public WVTWordList(int numClasses)
Create a new instance of WVTWordList.

Parameters:
numClasses - the number of possible class values


WVTWordList

public WVTWordList(java.util.List words,
                   int numClasses)

WVTWordList

public WVTWordList(java.io.Reader in)
Create a new instance of WVTWordList by reading it from a stream.

Parameters:
in - the stream from which to read the information

Method Detail

addWordOccurance

public void addWordOccurance(java.lang.String word)
Count the occurance of the given word.

Parameters:
word - the word


closeDocument

public void closeDocument(WVTDocumentInfo d)
Used to reset the calculation for individual documents after the given document has been processed.

Parameters:
d - information about the document


getFrequenciesForCurrentDocument

public int[] getFrequenciesForCurrentDocument()
Get the word frequencies for the document that is currently processed.

Returns:
an array containing the word frequencies


getTermCountForCurrentDocument

public int getTermCountForCurrentDocument()

getDocumentFrequencies

public int[] getDocumentFrequencies()
Get the document frequencies.

Returns:
an array containing the document frequencies


getClassFrequencies

public int[] getClassFrequencies(int classValue)
Get the document frequencies of documents having a given class value.

Parameters:
classValue - the class value
Returns:
an array containing the document frequencies for the given class


store

public void store(java.io.Writer out)
Write the wordlist to a stream.

Parameters:
out - the stream to which to write the word list


storePlain

public void storePlain(java.io.Writer out)
Write the wordlist to a stream without any additional info.

Parameters:
out - the stream to which to write the word list


isAppendWords

public boolean isAppendWords()
Returns the appendWords.

Returns:
boolean


isUpdateOnlyCurrent

public boolean isUpdateOnlyCurrent()
Returns the updateOnlyCurrent.

Returns:
boolean


setAppendWords

public void setAppendWords(boolean appendWords)
Sets the appendWords.

Parameters:
appendWords - The appendWords to set


setUpdateOnlyCurrent

public void setUpdateOnlyCurrent(boolean updateOnlyCurrent)
Sets the updateOnlyCurrent.

Parameters:
updateOnlyCurrent - The updateOnlyCurrent to set


getNumDocuments

public int getNumDocuments()
Returns the numDocuments.

Returns:
int


getNumWords

public int getNumWords()
Return the number of words in the list.

Returns:
the number of words


pruneByFrequency

public void pruneByFrequency(int min,
                             int max)
Prune the word list by document frequencies.

Parameters:
min - minimal frequency, all words with less frequency will be deleted
max - maximal frequency, all words with more frequency will be deleted


getFrequencyByRank

public int getFrequencyByRank(int p)
Returns the document frequency of the word that is on the p-th rank, assuming that each word occupies exactly one rank. Ranks start by 1.

Parameters:
p - the rank of the word starting with 1 for the first rank
Returns:
the frequency of the word on the p-pth rank.


getWord

public java.lang.String getWord(int index)
Returns the WVTWord with the given index.

Parameters:
index - the index of the word