Package com.ctc.wstx.util
Class WordSet
java.lang.Object
com.ctc.wstx.util.WordSet
An efficient (both memory and time) implementation of a Set used to
verify that a given
word is contained within the set. The general usage pattern is expected
to be such that most checks are positive, ie. that the word indeed
is contained in the set.
Performance of the set is comparable to that of TreeSet
for Strings, ie. 2-3x slower than HashSet
when
using pre-constructed Strings. This is generally result of algorithmic
complexity of structures; Word and Tree sets are roughly logarithmic
to the whole data, whereas Hash set is linear to the length of key.
However:
- WordSet can use char arrays as keys, without constructing Strings. In cases where there is no (need for) Strings, WordSet seems to be about twice as fast, even without considering additional GC caused by temporary String instances.
- WordSet is more compact in its memory presentation; if Strings are shared its size is comparable to optimally filled HashSet, and if no such Strings exists, its much more compact (relatively speaking)
Although this is an efficient set for specific set of usage patterns, one restriction is that the full set of words to include has to be known before constructing the set. Also, the size of the set is limited to total word content of about 20k characters; factory method does verify the limit and indicates if an instance can not be created.
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final char
(package private) final char[]
Compressed presentation of the word set.(package private) static final int
This is actually just a guess; but in general linear search should be faster for short sequences (definitely for 4 or less; maybe up to 8 or less?)(package private) static final int
Offset added to numbers to mark 'negative' numbers. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic char[]
constructRaw
(TreeSet<String> wordSet) static WordSet
constructSet
(TreeSet<String> wordSet) static boolean
contains
(char[] data, char[] str, int start, int end) boolean
contains
(char[] buf, int start, int end) static boolean
boolean
-
Field Details
-
CHAR_NULL
static final char CHAR_NULL- See Also:
-
NEGATIVE_OFFSET
static final int NEGATIVE_OFFSETOffset added to numbers to mark 'negative' numbers. Asymmetric, since range of negative markers needed is smaller than positive numbers...- See Also:
-
MIN_BINARY_SEARCH
static final int MIN_BINARY_SEARCHThis is actually just a guess; but in general linear search should be faster for short sequences (definitely for 4 or less; maybe up to 8 or less?)- See Also:
-
mData
final char[] mDataCompressed presentation of the word set.
-
-
Constructor Details
-
WordSet
private WordSet(char[] data)
-
-
Method Details
-
constructSet
-
constructRaw
-
contains
public boolean contains(char[] buf, int start, int end) -
contains
public static boolean contains(char[] data, char[] str, int start, int end) -
contains
-
contains
-