See: Description
Class | Description |
---|---|
BaseCharFilter |
Base utility class for implementing a
CharFilter . |
HTMLStripCharFilter |
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
|
HTMLStripCharFilterFactory |
Factory for
HTMLStripCharFilter . |
MappingCharFilter |
Simplistic
CharFilter that applies the mappings
contained in a NormalizeCharMap to the character
stream, and correcting the resulting changes to the
offsets. |
MappingCharFilterFactory |
Factory for
MappingCharFilter . |
NormalizeCharMap |
Holds a map of String input to String output, to be used
with
MappingCharFilter . |
NormalizeCharMap.Builder |
Builds an NormalizeCharMap.
|
Normalization of text before the tokenizer.
CharFilters are chainable filters that normalize text before tokenization and provide mappings between normalized text offsets and the corresponding offset in the original text.
CharFilters modify an input stream via a series of substring replacements (including deletions and insertions) to produce an output stream. There are three possible replacement cases: the replacement string has the same length as the original substring; the replacement is shorter; and the replacement is longer. In the latter two cases (when the replacement has a different length than the original), one or more offset correction mappings are required.
When the replacement is shorter than the original (e.g. when the
replacement is the empty string), a single offset correction mapping
should be added at the replacement's end offset in the output stream.
The cumulativeDiff
parameter to the
addOffCorrectMapping()
method will be the sum of all
previous replacement offset adjustments, with the addition of the
difference between the lengths of the original substring and the
replacement string (a positive value).
When the replacement is longer than the original (e.g. when the
original is the empty string), you should add as many offset
correction mappings as the difference between the lengths of the
replacement string and the original substring, starting at the
end offset the original substring would have had in the output stream.
The cumulativeDiff
parameter to the
addOffCorrectMapping()
method will be the sum of all
previous replacement offset adjustments, with the addition of the
difference between the lengths of the original substring and the
replacement string so far (a negative value).
Copyright © 2000-2014 Apache Software Foundation. All Rights Reserved.