Class CFSA

All Implemented Interfaces:
Iterable<ByteBuffer>

public final class CFSA extends FSA
CFSA (Compact Finite State Automaton) binary format implementation. This is a slightly reorganized version of FSA5 offering smaller automata size at some (minor) performance penalty.

Note: Serialize to CFSA2 for new code.

The encoding of automaton body is as follows.

 ---- FSA header (standard)
 Byte                            Description 
       +-+-+-+-+-+-+-+-+\
     0 | | | | | | | | | +------ '\'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ 'f'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     2 | | | | | | | | | +------ 's'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     3 | | | | | | | | | +------ 'a'
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     4 | | | | | | | | | +------ version (fixed 0xc5)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     5 | | | | | | | | | +------ filler character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     6 | | | | | | | | | +------ annot character
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     7 |C|C|C|C|G|G|G|G| +------ C - node data size (ctl), G - address size (gotoLength)
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
  8-32 | | | | | | | | | +------ labels mapped for type (1) of arc encoding. 
       : : : : : : : : : |
       +-+-+-+-+-+-+-+-+/
 
 ---- Start of a node; only if automaton was compiled with NUMBERS option.
 
 Byte
        +-+-+-+-+-+-+-+-+\
      0 | | | | | | | | | \  LSB
        +-+-+-+-+-+-+-+-+  +
      1 | | | | | | | | |  |      number of strings recognized
        +-+-+-+-+-+-+-+-+  +----- by the automaton starting
        : : : : : : : : :  |      from this node.
        +-+-+-+-+-+-+-+-+  +
  ctl-1 | | | | | | | | | /  MSB
        +-+-+-+-+-+-+-+-+/
        
 ---- A vector of node's arcs. Conditional format, depending on flags.
 
 1) NEXT bit set, mapped arc label. 
 
                +--------------- arc's label mapped in M bits if M's field value > 0
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
         _______| | | +--------- the arc is final
        /       | | | |
       +-+-+-+-+-+-+-+-+\
     0 |M|M|M|M|M|1|L|F| +------ flags + (M) index of the mapped label.
       +-+-+-+-+-+-+-+-+/
 
 2) NEXT bit set, label separate.
 
                +--------------- arc's label stored separately (M's field is zero).
                | +------------- node pointed to is next
                | | +----------- the last arc of the node
                | | | +--------- the arc is final
                | | | |
       +-+-+-+-+-+-+-+-+\
     0 |0|0|0|0|0|1|L|F| +------ flags
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
 
 3) NEXT bit not set. Full arc.
 
                  +------------- node pointed to is next
                  | +----------- the last arc of the node
                  | | +--------- the arc is final
                  | | |
       +-+-+-+-+-+-+-+-+\
     0 |A|A|A|A|A|0|L|F| +------ flags + (A) address field, lower bits
       +-+-+-+-+-+-+-+-+/
       +-+-+-+-+-+-+-+-+\
     1 | | | | | | | | | +------ label
       +-+-+-+-+-+-+-+-+/
       : : : : : : : : :       
       +-+-+-+-+-+-+-+-+\
 gtl-1 |A|A|A|A|A|A|A|A| +------ address, continuation (MSB)
       +-+-+-+-+-+-+-+-+/
 
  • Field Details

    • VERSION

      public static final byte VERSION
      Automaton header version value.
      See Also:
    • BIT_FINAL_ARC

      public static final int BIT_FINAL_ARC
      Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.
      See Also:
    • BIT_LAST_ARC

      public static final int BIT_LAST_ARC
      Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.
      See Also:
    • BIT_TARGET_NEXT

      public static final int BIT_TARGET_NEXT
      Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).
      See Also:
    • arcs

      public byte[] arcs
      An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.
    • nodeDataLength

      public final int nodeDataLength
      The length of the node header structure (if the automaton was compiled with NUMBERS option). Otherwise zero.
    • flags

      private final Set<FSAFlags> flags
      Flags for this automaton version.
    • gtl

      public final int gtl
      Number of bytes each address takes in full, expanded form (goto length).
    • labelMapping

      public final byte[] labelMapping
      Label mapping for arcs of type (1) (see class documentation). The array is indexed by mapped label's value and contains the original label.
  • Constructor Details

  • Method Details

    • getRootNode

      public int getRootNode()
      Returns the start node of this automaton. May return 0 if the start node is also an end node.
      Specified by:
      getRootNode in class FSA
      Returns:
      Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).
    • getFirstArc

      public final int getFirstArc(int node)
      Specified by:
      getFirstArc in class FSA
      Parameters:
      node - Identifier of the node.
      Returns:
      Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.
    • getNextArc

      public final int getNextArc(int arc)
      Specified by:
      getNextArc in class FSA
      Parameters:
      arc - The arc's identifier.
      Returns:
      Returns the identifier of the next arc after arc and leaving node. Zero is returned if no more arcs are available for the node.
    • getArc

      public int getArc(int node, byte label)
      Specified by:
      getArc in class FSA
      Parameters:
      node - Identifier of the node.
      label - The arc's label.
      Returns:
      Returns the identifier of an arc leaving node and labeled with label. An identifier equal to 0 means the node has no outgoing arc labeled label.
    • getEndNode

      public int getEndNode(int arc)
      Specified by:
      getEndNode in class FSA
      Parameters:
      arc - The arc's identifier.
      Returns:
      Return the end node pointed to by a given arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.
    • getArcLabel

      public byte getArcLabel(int arc)
      Specified by:
      getArcLabel in class FSA
      Parameters:
      arc - The arc's identifier.
      Returns:
      Return the label associated with a given arc.
    • getRightLanguageCount

      public int getRightLanguageCount(int node)
      Overrides:
      getRightLanguageCount in class FSA
      Parameters:
      node - Identifier of the node.
      Returns:
      Returns the number of sequences reachable from the given state if the automaton was compiled with FSAFlags.NUMBERS. The size of the right language of the state, in other words.
    • isArcFinal

      public boolean isArcFinal(int arc)
      Specified by:
      isArcFinal in class FSA
      Parameters:
      arc - The arc's identifier.
      Returns:
      Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.
    • isArcTerminal

      public boolean isArcTerminal(int arc)
      Specified by:
      isArcTerminal in class FSA
      Parameters:
      arc - The arc's identifier.
      Returns:
      Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception). Implies FSA.isArcFinal(int).
    • isArcLast

      public boolean isArcLast(int arc)
      Returns true if this arc has NEXT bit set.
      Parameters:
      arc - The node's arc identifier.
      Returns:
      Returns true if the argument is the last arc of a node.
      See Also:
    • isNextSet

      public boolean isNextSet(int arc)
      Parameters:
      arc - The node's arc identifier.
      Returns:
      Returns true if BIT_TARGET_NEXT is set for this arc.
      See Also:
    • isLabelCompressed

      public boolean isLabelCompressed(int arc)
      Parameters:
      arc - The node's arc identifier.
      Returns:
      Returns true if the label is compressed inside flags byte.
    • getFlags

      public Set<FSAFlags> getFlags()

      For this automaton version, an additional FSAFlags.NUMBERS flag may be set to indicate the automaton contains extra fields for each node.

      Specified by:
      getFlags in class FSA
      Returns:
      Returns a set of flags for this FSA instance.
    • getDestinationNodeOffset

      final int getDestinationNodeOffset(int arc)
      Returns the address of the node pointed to by this arc.
    • skipArc

      private int skipArc(int offset)
      Read the arc's layout and skip as many bytes, as needed, to skip it.