edu.udo.cs.yale.example
Class FileDataRowReader

java.lang.Object
  extended by edu.udo.cs.yale.example.AbstractDataRowReader
      extended by edu.udo.cs.yale.example.FileDataRowReader
All Implemented Interfaces:
DataRowReader, java.util.Iterator<DataRow>

public class FileDataRowReader
extends AbstractDataRowReader

FileDataRowReader implements a DataRowReader that reads DataRows from a file. This is the main data reader for many file formats (including csv) and is used by the ExampleSource operator and the attribute editor.

This class supports the reading of data from multiple source files. Each attribute (including special attributes like labels, weights, ...) might be read from another file. Please note that only the minimum number of lines of all files will be read, i.e. if one of the data source files has less lines than the others, only this number of data rows will be read.

The split points can be defined with regular expressions (please refer to the Java API). Quoting is possible but not suggested since the runtime is higher. The user should ensure that the split characters are not included in the data columns. Please refer to YaleLineReader for further information.

Unknown attribute values can be marked with empty strings or "?".

Version:
$Id: FileDataRowReader.java,v 2.21 2006/03/27 13:21:58 ingomierswa Exp $
Author:
Ingo Mierswa

Field Summary
private  Attribute[] attributes
          The attribute descriptions.
private static int COLUMN_NR
           
private  java.lang.String[][] currentData
          This array hold the current data.
private  int[][] dataSourceIndex
          Array of size [number of attributes][2].
private  boolean eof
          Remember if an end of file has occured.
private  int[] expectedNumberOfColumns
          This array holds the information how many columns each data source should provide.
private static int FILE_NR
           
private  java.io.BufferedReader[] fileReader
          The file readers.
private  boolean lineRead
          Remember if a line has already been read.
private  int linesRead
          The number of lines read so far (i.e. the number of examples).
private  int maxNumber
          The maximum number of examples to read (sampling).
private  RandomGenerator random
          The random generator used for sampling.
private  double sampleRatio
          The sample ratio.
private  YaleLineReader yaleLineReader
          This reader maps lines read from a file to Yale columns.
 
Constructor Summary
FileDataRowReader(DataRowFactory factory, java.util.List<AttributeDataSource> attributeDataSources, double sampleRatio, int sampleSize, java.lang.String separatorsRegExpr, char[] commentChars, boolean useQuotes, char decimalPointCharacter, RandomGenerator random)
          Constructs a new FileDataRowReader.
FileDataRowReader(DataRowFactory factory, java.util.List<AttributeDataSource> attributeDataSources, double sampleRatio, int sampleSize, java.lang.String separatorsRegExpr, char[] commentChars, boolean useQuotes, RandomGenerator random)
          Constructs a new FileDataRowReader.
 
Method Summary
 boolean hasNext()
          Checks if another line exists and reads.
private  void initReader(DataRowFactory factory, java.util.List<AttributeDataSource> attributeDataSources, int sampleSize, java.lang.String separatorsRegExpr, char[] commentChars, boolean useQuotes, char decimalPointCharacter)
          Read the complete data.
 DataRow next()
          Returns the next Example.
private  boolean readLine()
          Reads a line of data from all file readers.
 void skipLine()
          Skips the next line, if present.
 
Methods inherited from class edu.udo.cs.yale.example.AbstractDataRowReader
getFactory, remove
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FILE_NR

private static final int FILE_NR
See Also:
Constant Field Values

COLUMN_NR

private static final int COLUMN_NR
See Also:
Constant Field Values

fileReader

private java.io.BufferedReader[] fileReader
The file readers.


attributes

private Attribute[] attributes
The attribute descriptions.


eof

private boolean eof
Remember if an end of file has occured.


lineRead

private boolean lineRead
Remember if a line has already been read.


sampleRatio

private double sampleRatio
The sample ratio.


maxNumber

private int maxNumber
The maximum number of examples to read (sampling).


linesRead

private int linesRead
The number of lines read so far (i.e. the number of examples).


currentData

private java.lang.String[][] currentData
This array hold the current data. The first dimension is used for distinguishing different sources and the second for data read from the corresponding source.


expectedNumberOfColumns

private int[] expectedNumberOfColumns
This array holds the information how many columns each data source should provide. Otherwise an IOException will be thrown. This information is only used for checks and error improvement.


yaleLineReader

private YaleLineReader yaleLineReader
This reader maps lines read from a file to Yale columns.


random

private RandomGenerator random
The random generator used for sampling.


dataSourceIndex

private int[][] dataSourceIndex
Array of size [number of attributes][2]. For each attribute i the value of dataSourceIndex[i][FILE_NR] is used as an index to fileReader and the value of dataSourceIndex[i][TOKEN_NR] specifies the index of the column to use for attribute i.

Constructor Detail

FileDataRowReader

public FileDataRowReader(DataRowFactory factory,
                         java.util.List<AttributeDataSource> attributeDataSources,
                         double sampleRatio,
                         int sampleSize,
                         java.lang.String separatorsRegExpr,
                         char[] commentChars,
                         boolean useQuotes,
                         RandomGenerator random)
                  throws java.io.IOException
Constructs a new FileDataRowReader. Uses as decimal point.

Parameters:
factory - Factory used to create data rows.
attributeDataSources - List of AttributeDataSources.
sampleRatio - the ratio of examples which will be read. Only used if sampleSize is -1.
sampleSize - Limit sample to the first sampleSize lines read from files. -1 for no limit, then the sampleRatio will be used.
separatorsRegExpr - a regular expression describing the separator characters for the columns of each line
commentChars - defines which characters are used to comment the rest of a line
useQuotes - indicates if quotes should be used and parsed. Slows down reading and should be avoided if possible
random - the random generator used for sampling
Throws:
java.io.IOException


FileDataRowReader

public FileDataRowReader(DataRowFactory factory,
                         java.util.List<AttributeDataSource> attributeDataSources,
                         double sampleRatio,
                         int sampleSize,
                         java.lang.String separatorsRegExpr,
                         char[] commentChars,
                         boolean useQuotes,
                         char decimalPointCharacter,
                         RandomGenerator random)
                  throws java.io.IOException
Constructs a new FileDataRowReader.

Parameters:
factory - Factory used to create data rows.
attributeDataSources - List of AttributeDataSources.
sampleRatio - the ratio of examples which will be read. Only used if sampleSize is -1.
sampleSize - Limit sample to the first sampleSize lines read from files. -1 for no limit, then the sampleRatio will be used.
separatorsRegExpr - a regular expression describing the separator characters for the columns of each line
commentChars - defines which characters are used to comment the rest of a line
useQuotes - indicates if quotes should be used and parsed. Slows down reading and should be avoided if possible
decimalPointCharacter - indicates the character used to define a decimal point separator
random - the random generator used for sampling
Throws:
java.io.IOException

Method Detail

skipLine

public void skipLine()
Skips the next line, if present.


initReader

private void initReader(DataRowFactory factory,
                        java.util.List<AttributeDataSource> attributeDataSources,
                        int sampleSize,
                        java.lang.String separatorsRegExpr,
                        char[] commentChars,
                        boolean useQuotes,
                        char decimalPointCharacter)
                 throws java.io.IOException
Read the complete data.

Throws:
java.io.IOException


readLine

private boolean readLine()
                  throws java.io.IOException
Reads a line of data from all file readers. Returns true if the line was readable, i.e. the end of the source files was not yet reached.

Throws:
java.io.IOException


hasNext

public boolean hasNext()
Checks if another line exists and reads. The next line is only read once even if this method is invoked more than once.


next

public DataRow next()
Returns the next Example.



Copyright © 2001-2006