Class WordSet

java.lang.Object
com.ctc.wstx.util.WordSet

public final class WordSet extends Object
An efficient (both memory and time) implementation of a Set used to verify that a given word is contained within the set. The general usage pattern is expected to be such that most checks are positive, ie. that the word indeed is contained in the set.

Performance of the set is comparable to that of TreeSet for Strings, ie. 2-3x slower than HashSet when using pre-constructed Strings. This is generally result of algorithmic complexity of structures; Word and Tree sets are roughly logarithmic to the whole data, whereas Hash set is linear to the length of key. However:

  • WordSet can use char arrays as keys, without constructing Strings. In cases where there is no (need for) Strings, WordSet seems to be about twice as fast, even without considering additional GC caused by temporary String instances.
  • WordSet is more compact in its memory presentation; if Strings are shared its size is comparable to optimally filled HashSet, and if no such Strings exists, its much more compact (relatively speaking)

Although this is an efficient set for specific set of usage patterns, one restriction is that the full set of words to include has to be known before constructing the set. Also, the size of the set is limited to total word content of about 20k characters; factory method does verify the limit and indicates if an instance can not be created.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    private static final class 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    (package private) static final char
     
    (package private) final char[]
    Compressed presentation of the word set.
    (package private) static final int
    This is actually just a guess; but in general linear search should be faster for short sequences (definitely for 4 or less; maybe up to 8 or less?)
    (package private) static final int
    Offset added to numbers to mark 'negative' numbers.
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
    WordSet(char[] data)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static char[]
     
    static WordSet
     
    static boolean
    contains(char[] data, char[] str, int start, int end)
     
    boolean
    contains(char[] buf, int start, int end)
     
    static boolean
    contains(char[] data, String str)
     
    boolean
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • CHAR_NULL

      static final char CHAR_NULL
      See Also:
    • NEGATIVE_OFFSET

      static final int NEGATIVE_OFFSET
      Offset added to numbers to mark 'negative' numbers. Asymmetric, since range of negative markers needed is smaller than positive numbers...
      See Also:
    • mData

      final char[] mData
      Compressed presentation of the word set.
  • Constructor Details

    • WordSet

      private WordSet(char[] data)
  • Method Details

    • constructSet

      public static WordSet constructSet(TreeSet<String> wordSet)
    • constructRaw

      public static char[] constructRaw(TreeSet<String> wordSet)
    • contains

      public boolean contains(char[] buf, int start, int end)
    • contains

      public static boolean contains(char[] data, char[] str, int start, int end)
    • contains

      public boolean contains(String str)
    • contains

      public static boolean contains(char[] data, String str)