Class Trie

java.lang.Object
com.ibm.icu.impl.Trie
Direct Known Subclasses:
CharTrie, IntTrie

public abstract class Trie extends Object

A trie is a kind of compressed, serializable table of values associated with Unicode code points (0..0x10ffff).

This class defines the basic structure of a trie and provides methods to retrieve the offsets to the actual data.

Data will be the form of an array of basic types, char or int.

The actual data format will have to be specified by the user in the inner static interface com.ibm.icu.impl.Trie.DataManipulate.

This trie implementation is optimized for getting offset while walking forward through a UTF-16 string. Therefore, the simplest and fastest access macros are the fromLead() and fromOffsetTrail() methods. The fromBMP() method are a little more complicated; they get offsets even for lead surrogate codepoints, while the fromLead() method get special "folded" offsets for lead surrogate code units if there is relevant data associated with them. From such a folded offsets, an offset needs to be extracted to supply to the fromOffsetTrail() methods. To handle such supplementary codepoints, some offset information are kept in the data.

Methods in com.ibm.icu.impl.Trie.DataManipulate are called to retrieve that offset from the folded value for the lead surrogate unit.

For examples of use, see com.ibm.icu.impl.CharTrie or com.ibm.icu.impl.IntTrie.

Since:
release 2.1, Jan 01 2002
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static interface 
    Character data in com.ibm.impl.Trie have different user-specified format for different purposes.
    private static class 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected static final int
    Length of the BMP portion of the index (stage 1) array.
    protected static final int
    Number of data values in a stage 2 (data array) block.
    protected static final int
    Size of Trie header in bytes
    protected static final int
     
    protected static final int
     
    protected static final int
    Latin 1 option mask
    private static final int
    Header option formatting
    protected static final int
    Constant number to authenticate the byte block
    protected static final int
    Shift size for shifting right the input index.
    protected static final int
    Shift size for shifting left the index array values.
    protected static final int
    Mask for getting the lower bits from the input index.
    protected static final int
    Lead surrogate code points' index displacement in the index array.
    protected int
    Length of the data array
    Internal TrieValue which handles the parsing of the data value.
    protected int
    Start index of the data portion of the trie.
    protected char[]
    Index or UTF16 characters
    private boolean
    Flag indicator for Latin quick access data block
    private int
    Trie options field.
    protected static final int
    Number of bits of a trail surrogate that are used in index table lookups.
    protected static final int
    Number of index (stage 1) entries per lead surrogate.
    protected static final int
    Surrogate mask to use when shifting offset to retrieve supplementary values
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    Trie(char[] index, int options, Trie.DataManipulate dataManipulate)
    Trie constructor
    protected
    Trie(ByteBuffer bytes, Trie.DataManipulate dataManipulate)
    Trie constructor for CharTrie use.
  • Method Summary

    Modifier and Type
    Method
    Description
    private final boolean
    checkHeader(int signature)
    Authenticates raw data header.
    boolean
    equals(Object other)
    Checks if the argument Trie has the same data as this Trie.
    protected final int
    getBMPOffset(char ch)
    Gets the offset to data which the BMP character points to Treats a lead surrogate as a normal code point.
    protected final int
    Internal trie getter from a code point.
    protected abstract int
    Gets the default initial value
    protected final int
    getLeadOffset(char ch)
    Gets the offset to the data which this lead surrogate character points to.
    protected final int
    getRawOffset(int offset, char ch)
    Gets the offset to the data which the index ch after variable offset points to.
    int
    Gets the serialized data file size of the Trie.
    protected abstract int
    getSurrogateOffset(char lead, char trail)
    Gets the offset to the data which the surrogate pair points to.
    protected abstract int
    getValue(int index)
    Gets the value at the argument index
    int
     
    protected final boolean
    Determines if this is a 16 bit trie
    protected final boolean
    Determines if this is a 32 bit trie
    final boolean
    Determines if this trie has a linear latin 1 array
    protected void
    Parses the byte buffer and creates the trie index with it.

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • LEAD_INDEX_OFFSET_

      protected static final int LEAD_INDEX_OFFSET_
      Lead surrogate code points' index displacement in the index array. 0x10000-0xd800=0x2800 0x2800 >> INDEX_STAGE_1_SHIFT_
      See Also:
    • INDEX_STAGE_1_SHIFT_

      protected static final int INDEX_STAGE_1_SHIFT_
      Shift size for shifting right the input index. 1..9
      See Also:
    • INDEX_STAGE_2_SHIFT_

      protected static final int INDEX_STAGE_2_SHIFT_
      Shift size for shifting left the index array values. Increases possible data size with 16-bit index values at the cost of compactability. This requires blocks of stage 2 data to be aligned by DATA_GRANULARITY. 0..INDEX_STAGE_1_SHIFT
      See Also:
    • DATA_BLOCK_LENGTH

      protected static final int DATA_BLOCK_LENGTH
      Number of data values in a stage 2 (data array) block.
      See Also:
    • INDEX_STAGE_3_MASK_

      protected static final int INDEX_STAGE_3_MASK_
      Mask for getting the lower bits from the input index. DATA_BLOCK_LENGTH - 1.
      See Also:
    • SURROGATE_BLOCK_BITS

      protected static final int SURROGATE_BLOCK_BITS
      Number of bits of a trail surrogate that are used in index table lookups.
      See Also:
    • SURROGATE_BLOCK_COUNT

      protected static final int SURROGATE_BLOCK_COUNT
      Number of index (stage 1) entries per lead surrogate. Same as number of index entries for 1024 trail surrogates, ==0x400>>INDEX_STAGE_1_SHIFT_
      See Also:
    • BMP_INDEX_LENGTH

      protected static final int BMP_INDEX_LENGTH
      Length of the BMP portion of the index (stage 1) array.
      See Also:
    • SURROGATE_MASK_

      protected static final int SURROGATE_MASK_
      Surrogate mask to use when shifting offset to retrieve supplementary values
      See Also:
    • m_index_

      protected char[] m_index_
      Index or UTF16 characters
    • m_dataManipulate_

      protected Trie.DataManipulate m_dataManipulate_
      Internal TrieValue which handles the parsing of the data value. This class is to be implemented by the user
    • m_dataOffset_

      protected int m_dataOffset_
      Start index of the data portion of the trie. CharTrie combines index and data into a char array, so this is used to indicate the initial offset to the data portion. Note this index always points to the initial value.
    • m_dataLength_

      protected int m_dataLength_
      Length of the data array
    • HEADER_LENGTH_

      protected static final int HEADER_LENGTH_
      Size of Trie header in bytes
      See Also:
    • HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_

      protected static final int HEADER_OPTIONS_LATIN1_IS_LINEAR_MASK_
      Latin 1 option mask
      See Also:
    • HEADER_SIGNATURE_

      protected static final int HEADER_SIGNATURE_
      Constant number to authenticate the byte block
      See Also:
    • HEADER_OPTIONS_SHIFT_MASK_

      private static final int HEADER_OPTIONS_SHIFT_MASK_
      Header option formatting
      See Also:
    • HEADER_OPTIONS_INDEX_SHIFT_

      protected static final int HEADER_OPTIONS_INDEX_SHIFT_
      See Also:
    • HEADER_OPTIONS_DATA_IS_32_BIT_

      protected static final int HEADER_OPTIONS_DATA_IS_32_BIT_
      See Also:
    • m_isLatin1Linear_

      private boolean m_isLatin1Linear_
      Flag indicator for Latin quick access data block
    • m_options_

      private int m_options_

      Trie options field.

      options bit field:
      9 1 = Latin-1 data is stored linearly at data + DATA_BLOCK_LENGTH
      8 0 = 16-bit data, 1=32-bit data
      7..4 INDEX_STAGE_1_SHIFT // 0..INDEX_STAGE_2_SHIFT
      3..0 INDEX_STAGE_2_SHIFT // 1..9

  • Constructor Details

    • Trie

      protected Trie(ByteBuffer bytes, Trie.DataManipulate dataManipulate)
      Trie constructor for CharTrie use.
      Parameters:
      bytes - data of an ICU data file, containing the trie
      dataManipulate - object containing the information to parse the trie data
    • Trie

      protected Trie(char[] index, int options, Trie.DataManipulate dataManipulate)
      Trie constructor
      Parameters:
      index - array to be used for index
      options - used by the trie
      dataManipulate - object containing the information to parse the trie data
  • Method Details

    • isLatin1Linear

      public final boolean isLatin1Linear()
      Determines if this trie has a linear latin 1 array
      Returns:
      true if this trie has a linear latin 1 array, false otherwise
    • equals

      public boolean equals(Object other)
      Checks if the argument Trie has the same data as this Trie. Attributes are checked but not the index data.
      Overrides:
      equals in class Object
      Parameters:
      other - Trie to check
      Returns:
      true if the argument Trie has the same data as this Trie, false otherwise
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object
    • getSerializedDataSize

      public int getSerializedDataSize()
      Gets the serialized data file size of the Trie. This is used during trie data reading for size checking purposes.
      Returns:
      size size of serialized trie data file in terms of the number of bytes
    • getSurrogateOffset

      protected abstract int getSurrogateOffset(char lead, char trail)
      Gets the offset to the data which the surrogate pair points to.
      Parameters:
      lead - lead surrogate
      trail - trailing surrogate
      Returns:
      offset to data
    • getValue

      protected abstract int getValue(int index)
      Gets the value at the argument index
      Parameters:
      index - value at index will be retrieved
      Returns:
      32 bit value
    • getInitialValue

      protected abstract int getInitialValue()
      Gets the default initial value
      Returns:
      32 bit value
    • getRawOffset

      protected final int getRawOffset(int offset, char ch)
      Gets the offset to the data which the index ch after variable offset points to. Note for locating a non-supplementary character data offset, calling

      getRawOffset(0, ch);

      will do. Otherwise if it is a supplementary character formed by surrogates lead and trail. Then we would have to call getRawOffset() with getFoldingIndexOffset(). See getSurrogateOffset().
      Parameters:
      offset - index offset which ch is to start from
      ch - index to be used after offset
      Returns:
      offset to the data
    • getBMPOffset

      protected final int getBMPOffset(char ch)
      Gets the offset to data which the BMP character points to Treats a lead surrogate as a normal code point.
      Parameters:
      ch - BMP character
      Returns:
      offset to data
    • getLeadOffset

      protected final int getLeadOffset(char ch)
      Gets the offset to the data which this lead surrogate character points to. Data at the returned offset may contain folding offset information for the next trailing surrogate character.
      Parameters:
      ch - lead surrogate character
      Returns:
      offset to data
    • getCodePointOffset

      protected final int getCodePointOffset(int ch)
      Internal trie getter from a code point. Could be faster(?) but longer with if((c32)<=0xd7ff) { (result)=_TRIE_GET_RAW(trie, data, 0, c32); } Gets the offset to data which the codepoint points to
      Parameters:
      ch - codepoint
      Returns:
      offset to data
    • unserialize

      protected void unserialize(ByteBuffer bytes)

      Parses the byte buffer and creates the trie index with it.

      The position of the input ByteBuffer must be right after the trie header.

      This is overwritten by the child classes.

      Parameters:
      bytes - buffer containing trie data
    • isIntTrie

      protected final boolean isIntTrie()
      Determines if this is a 32 bit trie
      Returns:
      true if options specifies this is a 32 bit trie
    • isCharTrie

      protected final boolean isCharTrie()
      Determines if this is a 16 bit trie
      Returns:
      true if this is a 16 bit trie
    • checkHeader

      private final boolean checkHeader(int signature)
      Authenticates raw data header. Checking the header information, signature and options.
      Parameters:
      signature - This contains the options and type of a Trie
      Returns:
      true if the header is authenticated valid