Class ScannerSrxTextIterator
- All Implemented Interfaces:
Iterator<String>
,TextIterator
Quick and Dirty implementation of TextIterator
using Scanner
.
Preliminary tests showed that it requires between 50% and 100% more time to complete than default text iterator. Probably the reason is slow matching of exception rules, but also splitting break-rule-only is slower.
This implementation is also not able to solve overlapping rules, like other one-big-pattern-scan iterators and there seems to be no easy solution. Although this should not happen in input patterns, in large SRX file using cascading it is very easy to miss this.
One solution could be sorting patterns by length, but this is sometimes
impossible to do. For example:
Rules are "(ab)+" and "a(b)+"
Inputs are "ababx" and "abbbx"
For first input order of exception rules should be reversed for the text
to be split as early as possible, but for the second input it shouldn't.
The solution could be to use reluctant quantifiers instead of greedy ones,
but that is changing the input patterns provided by user and therefore
is undesirable.
-
Field Summary
Fields -
Constructor Summary
ConstructorsModifierConstructorDescriptionScannerSrxTextIterator
(SrxDocument document, String languageCode, Reader reader, Map<String, Object> parameterMap) ScannerSrxTextIterator
(SrxDocument document, String languageCode, String text, Map<String, Object> parameterMap) private
ScannerSrxTextIterator
(SrxDocument document, String languageCode, Scanner scanner) -
Method Summary
Modifier and TypeMethodDescriptionprivate String
private String
private String
createExceptionRegex
(Rule rule) createExceptions
(List<LanguageRule> languageRuleList) private String
createSeparator
(List<LanguageRule> languageRuleList) boolean
hasNext()
private boolean
isException
(StringBuilder segment) next()
Methods inherited from class net.loomchild.segment.AbstractTextIterator
remove, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface java.util.Iterator
forEachRemaining
-
Field Details
-
scanner
-
exceptionMap
-
noBreakRules
private boolean noBreakRules
-
-
Constructor Details
-
ScannerSrxTextIterator
-
ScannerSrxTextIterator
-
ScannerSrxTextIterator
-
-
Method Details
-
createSeparator
-
createExceptions
-
createBreakRegexLookahead
-
createBreakRegexNoLookahead
-
createExceptionRegex
-
hasNext
public boolean hasNext()- Returns:
- true if there are more segments
-
next
- Returns:
- next segment in text, or null if end of text has been reached.
-
isException
-