Interface ISequenceEncoder

All Known Implementing Classes:
NoEncoder, TrimInfixAndSuffixEncoder, TrimPrefixAndSuffixEncoder, TrimSuffixEncoder

public interface ISequenceEncoder
The logic of encoding one sequence of bytes relative to another sequence of bytes. The "base" form and the "derived" form are typically the stem of a word and the inflected form of a word.

Derived form encoding helps in making the data for the automaton smaller and more repetitive (which results in higher compression rates).

See example implementation for details.

  • Method Details

    • encode

      ByteBuffer encode(ByteBuffer reuse, ByteBuffer source, ByteBuffer target)
      Encodes target relative to source, optionally reusing the provided ByteBuffer.
      Parameters:
      reuse - Reuses the provided ByteBuffer or allocates a new one if there is not enough remaining space.
      source - The source byte sequence.
      target - The target byte sequence to encode relative to source
      Returns:
      Returns the ByteBuffer with encoded target.
    • decode

      ByteBuffer decode(ByteBuffer reuse, ByteBuffer source, ByteBuffer encoded)
      Decodes encoded relative to source, optionally reusing the provided ByteBuffer.
      Parameters:
      reuse - Reuses the provided ByteBuffer or allocates a new one if there is not enough remaining space.
      source - The source byte sequence.
      encoded - The previously encoded byte sequence.
      Returns:
      Returns the ByteBuffer with decoded target.
    • prefixBytes

      @Deprecated int prefixBytes()
      Deprecated.
      The number of encoded form's prefix bytes that should be ignored (needed for separator lookup). An ugly workaround for GH-85, should be fixed by prior knowledge of whether the dictionary contains tags; then we can scan for separator right-to-left.
      See Also:
      • "https://github.com/morfologik/morfologik-stemming/issues/85"