org.apache.solr.internal.csv
Class CSVParser

java.lang.Object
  extended by org.apache.solr.internal.csv.CSVParser

public class CSVParser
extends Object

Parses CSV files according to the specified configuration. Because CSV appears in many different dialects, the parser supports many configuration settings by allowing the specification of a CSVStrategy.

Parsing of a csv-string having tabs as separators, '"' as an optional value encapsulator, and comments starting with '#':

  String[][] data = 
   (new CSVParser(new StringReader("a\tb\nc\td"), new CSVStrategy('\t','"','#'))).getAllValues();
 

Parsing of a csv-string in Excel CSV format

  String[][] data =
   (new CSVParser(new StringReader("a;b\nc;d"), CSVStrategy.EXCEL_STRATEGY)).getAllValues();
 

Internal parser state is completely covered by the strategy and the reader-state.

see package documentation for more details


Field Summary
protected static int TT_EOF
          Token (which can have content) when end of file is reached.
protected static int TT_EORECORD
          Token with content when end of a line is reached.
protected static int TT_INVALID
          Token has no valid content, i.e.
protected static int TT_TOKEN
          Token with content, at beginning or in the middle of a line.
 
Constructor Summary
CSVParser(Reader input)
          CSV parser using the default CSVStrategy.
CSVParser(Reader input, char delimiter)
          Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(Reader input, char delimiter, char encapsulator, char commentStart)
          Deprecated. use CSVParser(Reader,CSVStrategy).
CSVParser(Reader input, CSVStrategy strategy)
          Customized CSV parser using the given CSVStrategy
 
Method Summary
 String[][] getAllValues()
          Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).
 String[] getLine()
          Parses from the current point in the stream til the end of the current line.
 int getLineNumber()
          Returns the current line number in the input stream.
 CSVStrategy getStrategy()
          Obtain the specified CSV Strategy.
protected  org.apache.solr.internal.csv.CSVParser.Token nextToken()
          Convenience method for nextToken(null).
protected  org.apache.solr.internal.csv.CSVParser.Token nextToken(org.apache.solr.internal.csv.CSVParser.Token tkn)
          Returns the next token.
 String nextValue()
          Parses the CSV according to the given strategy and returns the next csv-value as string.
protected  int unicodeEscapeLexer(int c)
          Decodes Unicode escapes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TT_INVALID

protected static final int TT_INVALID
Token has no valid content, i.e. is in its initilized state.

See Also:
Constant Field Values

TT_TOKEN

protected static final int TT_TOKEN
Token with content, at beginning or in the middle of a line.

See Also:
Constant Field Values

TT_EOF

protected static final int TT_EOF
Token (which can have content) when end of file is reached.

See Also:
Constant Field Values

TT_EORECORD

protected static final int TT_EORECORD
Token with content when end of a line is reached.

See Also:
Constant Field Values
Constructor Detail

CSVParser

public CSVParser(Reader input)
CSV parser using the default CSVStrategy.

Parameters:
input - a Reader containing "csv-formatted" input

CSVParser

public CSVParser(Reader input,
                 char delimiter)
Deprecated. use CSVParser(Reader,CSVStrategy).

Customized value delimiter parser. The parser follows the default CSVStrategy except for the delimiter setting.

Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation

CSVParser

public CSVParser(Reader input,
                 char delimiter,
                 char encapsulator,
                 char commentStart)
Deprecated. use CSVParser(Reader,CSVStrategy).

Customized csv parser. The parser parses according to the given CSV dialect settings. Leading whitespaces are truncated, unicode escapes are not interpreted and empty lines are ignored.

Parameters:
input - a Reader based on "csv-formatted" input
delimiter - a Char used for value separation
encapsulator - a Char used as value encapsulation marker
commentStart - a Char used for comment identification

CSVParser

public CSVParser(Reader input,
                 CSVStrategy strategy)
Customized CSV parser using the given CSVStrategy

Parameters:
input - a Reader containing "csv-formatted" input
strategy - the CSVStrategy used for CSV parsing
Method Detail

getAllValues

public String[][] getAllValues()
                        throws IOException
Parses the CSV according to the given strategy and returns the content as an array of records (whereas records are arrays of single values).

The returned content starts at the current parse-position in the stream.

Returns:
matrix of records x values ('null' when end of file)
Throws:
IOException - on parse error or input read-failure

nextValue

public String nextValue()
                 throws IOException
Parses the CSV according to the given strategy and returns the next csv-value as string.

Returns:
next value in the input stream ('null' when end of file)
Throws:
IOException - on parse error or input read-failure

getLine

public String[] getLine()
                 throws IOException
Parses from the current point in the stream til the end of the current line.

Returns:
array of values til end of line ('null' when end of file has been reached)
Throws:
IOException - on parse error or input read-failure

getLineNumber

public int getLineNumber()
Returns the current line number in the input stream. ATTENTION: in case your csv has multiline-values the returned number does not correspond to the record-number

Returns:
current line number

nextToken

protected org.apache.solr.internal.csv.CSVParser.Token nextToken()
                                                          throws IOException
Convenience method for nextToken(null).

Throws:
IOException

nextToken

protected org.apache.solr.internal.csv.CSVParser.Token nextToken(org.apache.solr.internal.csv.CSVParser.Token tkn)
                                                          throws IOException
Returns the next token. A token corresponds to a term, a record change or an end-of-file indicator.

Parameters:
tkn - an existing Token object to reuse. The caller is responsible to initialize the Token.
Returns:
the next token found
Throws:
IOException - on stream access error

unicodeEscapeLexer

protected int unicodeEscapeLexer(int c)
                          throws IOException
Decodes Unicode escapes. Interpretation of "\\uXXXX" escape sequences where XXXX is a hex-number.

Parameters:
c - current char which is discarded because it's the "\\" of "\\uXXXX"
Returns:
the decoded character
Throws:
IOException - on wrong unicode escape sequence or read error

getStrategy

public CSVStrategy getStrategy()
Obtain the specified CSV Strategy. This should not be modified.

Returns:
strategy currently being used


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.