Package com.ibm.icu.impl
Class Normalizer2Impl
java.lang.Object
com.ibm.icu.impl.Normalizer2Impl
Low-level implementation of the Unicode Normalization Algorithm.
For the data structure and details see the documentation at the end of
C++ normalizer2impl.h and in the design doc at
https://icu.unicode.org/design/normalization/custom
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
private static final class
static final class
Writable buffer that takes care of canonical ordering.static final class
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final int
private static final int
private static final int
private static final int
private CodePointTrie
private ArrayList<UnicodeSet>
private int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
private static final int
private VersionInfo
static final int
static final int
static final int
static final int
static final int
private String
static final int
static final int
private static final Normalizer2Impl.IsAcceptable
static final int
static final int
static final int
static final int
static final int
static final int
static final int
static final int
Mappings are comp-normalized.static final int
Mappings are not comp-normalized but have a comp boundary before.static final int
Mappings do not have a comp boundary before.static final int
Mappings to the empty string.static final int
Mappings & compositions in [minYesNo..minYesNoMappingsOnly[.static final int
Mappings only in [minYesNoMappingsOnly..minNoNo[.static final int
static final int
static final int
static final int
static final int
static final int
private int
static final int
static final int
static final int
static final int
private String
static final int
static final int
private int
private int
private int
private int
private int
private int
private int
private int
private int
private int
private CodePointTrie.Fast16
static final int
private static final CodePointMap.ValueFilter
private byte[]
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
private void
addComposites
(int list, UnicodeSet set) void
addLcccChars
(UnicodeSet set) void
private void
addToStartSet
(MutableCodePointTrie mutableTrie, int origin, int decompLead) private static int
Finds the recomposition result for a forward-combining "lead" character, specified with a pointer to its compositions list, and a backward-combining "trail" character.boolean
compose
(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doCompose, Normalizer2Impl.ReorderingBuffer buffer) void
composeAndAppend
(CharSequence s, boolean doCompose, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer) int
composePair
(int a, int b) int
composeQuickCheck
(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doSpan) Very similar to compose(): Make the same changes in both places if relevant.private void
decompose
(int c, int norm16, Normalizer2Impl.ReorderingBuffer buffer) int
decompose
(CharSequence s, int src, int limit, Normalizer2Impl.ReorderingBuffer buffer) void
decompose
(CharSequence s, int src, int limit, StringBuilder dest, int destLengthEstimate) Decomposes s[src, limit[ and writes the result to dest.decompose
(CharSequence s, StringBuilder dest) void
decomposeAndAppend
(CharSequence s, boolean doDecompose, Normalizer2Impl.ReorderingBuffer buffer) private int
decomposeShort
(CharSequence s, int src, int limit, boolean stopAtCompBoundary, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer) Builds the canonical-iterator data for this instance.private int
findNextCompBoundary
(CharSequence s, int p, int limit, boolean onlyContiguous) private int
findNextFCDBoundary
(CharSequence s, int p, int limit) private int
findPreviousCompBoundary
(CharSequence s, int p, boolean onlyContiguous) private int
findPreviousFCDBoundary
(CharSequence s, int p) boolean
getCanonStartSet
(int c, UnicodeSet set) Returns true if there are characters whose decomposition starts with c.int
getCC
(int norm16) private int
getCCFromNoNo
(int norm16) static int
getCCFromNormalYesOrMaybe
(int norm16) static int
getCCFromYesOrMaybe
(int norm16) int
getCCFromYesOrMaybeCP
(int c) private int
getCompositionsList
(int norm16) private int
getCompositionsListForComposite
(int norm16) private int
getCompositionsListForDecompYes
(int norm16) private int
getCompositionsListForMaybe
(int norm16) int
getCompQuickCheck
(int norm16) getDecomposition
(int c) Gets the decomposition for one code point.int
getFCD16
(int c) Returns the FCD data for code point c.int
getFCD16FromNormData
(int c) Gets the FCD value from the regular normalization data.int
getNorm16
(int c) private int
getPreviousTrailCC
(CharSequence s, int start, int p) getRawDecomposition
(int c) Gets the raw decomposition for one code point.int
getRawNorm16
(int c) (package private) int
getTrailCCFromCompYesAndZeroCC
(int norm16) private int
boolean
hasCompBoundaryAfter
(int c, boolean onlyContiguous) private boolean
hasCompBoundaryAfter
(CharSequence s, int start, int p, boolean onlyContiguous) boolean
hasCompBoundaryBefore
(int c) private boolean
hasCompBoundaryBefore
(int c, int norm16) Does c have a composition boundary before it? True if its decomposition begins with a character that has ccc=0 && NFC_QC=Yes (isCompYesAndZeroCC()).private boolean
hasCompBoundaryBefore
(CharSequence s, int src, int limit) boolean
hasDecompBoundaryAfter
(int c) boolean
hasDecompBoundaryBefore
(int c) boolean
hasFCDBoundaryAfter
(int c) boolean
hasFCDBoundaryBefore
(int c) boolean
isAlgorithmicNoNo
(int norm16) boolean
isCanonSegmentStarter
(int c) Returns true if code point c starts a canonical-iterator string segment.boolean
isCompInert
(int c, boolean onlyContiguous) boolean
isCompNo
(int norm16) private boolean
isCompYesAndZeroCC
(int norm16) boolean
isDecompInert
(int c) private boolean
isDecompNoAlgorithmic
(int norm16) boolean
isDecompYes
(int norm16) private boolean
isDecompYesAndZeroCC
(int norm16) boolean
isFCDInert
(int c) private boolean
isHangulLV
(int norm16) private boolean
isHangulLVT
(int norm16) private static boolean
isInert
(int norm16) private static boolean
isJamoL
(int norm16) private static boolean
isJamoVT
(int norm16) private boolean
isMaybe
(int norm16) private boolean
isMaybeOrNonZeroCC
(int norm16) private boolean
isMostDecompYesAndZeroCC
(int norm16) A little faster and simpler than isDecompYesAndZeroCC() but does not include the MaybeYes which combine-forward and have ccc=0.private boolean
isTrailCC01ForCompBoundaryAfter
(int norm16) For FCC: Given norm16 HAS_COMP_BOUNDARY_AFTER, does it have tccc<=1?load
(ByteBuffer bytes) int
makeFCD
(CharSequence s, int src, int limit, Normalizer2Impl.ReorderingBuffer buffer) void
makeFCDAndAppend
(CharSequence s, boolean doMakeFCD, Normalizer2Impl.ReorderingBuffer buffer) private int
mapAlgorithmic
(int c, int norm16) private boolean
norm16HasCompBoundaryAfter
(int norm16, boolean onlyContiguous) private boolean
norm16HasCompBoundaryBefore
(int norm16) boolean
norm16HasDecompBoundaryAfter
(int norm16) boolean
norm16HasDecompBoundaryBefore
(int norm16) private void
recompose
(Normalizer2Impl.ReorderingBuffer buffer, int recomposeStartIndex, boolean onlyContiguous) boolean
singleLeadMightHaveNonZeroFCD16
(int lead) Returns true if the single-or-lead code unit c might have non-zero FCD data.
-
Field Details
-
IS_ACCEPTABLE
-
DATA_FORMAT
private static final int DATA_FORMAT- See Also:
-
segmentStarterMapper
-
MIN_YES_YES_WITH_CC
public static final int MIN_YES_YES_WITH_CC- See Also:
-
JAMO_VT
public static final int JAMO_VT- See Also:
-
MIN_NORMAL_MAYBE_YES
public static final int MIN_NORMAL_MAYBE_YES- See Also:
-
JAMO_L
public static final int JAMO_L- See Also:
-
INERT
public static final int INERT- See Also:
-
HAS_COMP_BOUNDARY_AFTER
public static final int HAS_COMP_BOUNDARY_AFTER- See Also:
-
OFFSET_SHIFT
public static final int OFFSET_SHIFT- See Also:
-
DELTA_TCCC_0
public static final int DELTA_TCCC_0- See Also:
-
DELTA_TCCC_1
public static final int DELTA_TCCC_1- See Also:
-
DELTA_TCCC_GT_1
public static final int DELTA_TCCC_GT_1- See Also:
-
DELTA_TCCC_MASK
public static final int DELTA_TCCC_MASK- See Also:
-
DELTA_SHIFT
public static final int DELTA_SHIFT- See Also:
-
MAX_DELTA
public static final int MAX_DELTA- See Also:
-
IX_NORM_TRIE_OFFSET
public static final int IX_NORM_TRIE_OFFSET- See Also:
-
IX_EXTRA_DATA_OFFSET
public static final int IX_EXTRA_DATA_OFFSET- See Also:
-
IX_SMALL_FCD_OFFSET
public static final int IX_SMALL_FCD_OFFSET- See Also:
-
IX_RESERVED3_OFFSET
public static final int IX_RESERVED3_OFFSET- See Also:
-
IX_TOTAL_SIZE
public static final int IX_TOTAL_SIZE- See Also:
-
IX_MIN_DECOMP_NO_CP
public static final int IX_MIN_DECOMP_NO_CP- See Also:
-
IX_MIN_COMP_NO_MAYBE_CP
public static final int IX_MIN_COMP_NO_MAYBE_CP- See Also:
-
IX_MIN_YES_NO
public static final int IX_MIN_YES_NOMappings & compositions in [minYesNo..minYesNoMappingsOnly[.- See Also:
-
IX_MIN_NO_NO
public static final int IX_MIN_NO_NOMappings are comp-normalized.- See Also:
-
IX_LIMIT_NO_NO
public static final int IX_LIMIT_NO_NO- See Also:
-
IX_MIN_MAYBE_YES
public static final int IX_MIN_MAYBE_YES- See Also:
-
IX_MIN_YES_NO_MAPPINGS_ONLY
public static final int IX_MIN_YES_NO_MAPPINGS_ONLYMappings only in [minYesNoMappingsOnly..minNoNo[.- See Also:
-
IX_MIN_NO_NO_COMP_BOUNDARY_BEFORE
public static final int IX_MIN_NO_NO_COMP_BOUNDARY_BEFOREMappings are not comp-normalized but have a comp boundary before.- See Also:
-
IX_MIN_NO_NO_COMP_NO_MAYBE_CC
public static final int IX_MIN_NO_NO_COMP_NO_MAYBE_CCMappings do not have a comp boundary before.- See Also:
-
IX_MIN_NO_NO_EMPTY
public static final int IX_MIN_NO_NO_EMPTYMappings to the empty string.- See Also:
-
IX_MIN_LCCC_CP
public static final int IX_MIN_LCCC_CP- See Also:
-
IX_COUNT
public static final int IX_COUNT- See Also:
-
MAPPING_HAS_CCC_LCCC_WORD
public static final int MAPPING_HAS_CCC_LCCC_WORD- See Also:
-
MAPPING_HAS_RAW_MAPPING
public static final int MAPPING_HAS_RAW_MAPPING- See Also:
-
MAPPING_LENGTH_MASK
public static final int MAPPING_LENGTH_MASK- See Also:
-
COMP_1_LAST_TUPLE
public static final int COMP_1_LAST_TUPLE- See Also:
-
COMP_1_TRIPLE
public static final int COMP_1_TRIPLE- See Also:
-
COMP_1_TRAIL_LIMIT
public static final int COMP_1_TRAIL_LIMIT- See Also:
-
COMP_1_TRAIL_MASK
public static final int COMP_1_TRAIL_MASK- See Also:
-
COMP_1_TRAIL_SHIFT
public static final int COMP_1_TRAIL_SHIFT- See Also:
-
COMP_2_TRAIL_SHIFT
public static final int COMP_2_TRAIL_SHIFT- See Also:
-
COMP_2_TRAIL_MASK
public static final int COMP_2_TRAIL_MASK- See Also:
-
dataVersion
-
minDecompNoCP
private int minDecompNoCP -
minCompNoMaybeCP
private int minCompNoMaybeCP -
minLcccCP
private int minLcccCP -
minYesNo
private int minYesNo -
minYesNoMappingsOnly
private int minYesNoMappingsOnly -
minNoNo
private int minNoNo -
minNoNoCompBoundaryBefore
private int minNoNoCompBoundaryBefore -
minNoNoCompNoMaybeCC
private int minNoNoCompNoMaybeCC -
minNoNoEmpty
private int minNoNoEmpty -
limitNoNo
private int limitNoNo -
centerNoNoDelta
private int centerNoNoDelta -
minMaybeYes
private int minMaybeYes -
normTrie
-
maybeYesCompositions
-
extraData
-
smallFCD
private byte[] smallFCD -
canonIterData
-
canonStartSets
-
CANON_NOT_SEGMENT_STARTER
private static final int CANON_NOT_SEGMENT_STARTER- See Also:
-
CANON_HAS_COMPOSITIONS
private static final int CANON_HAS_COMPOSITIONS- See Also:
-
CANON_HAS_SET
private static final int CANON_HAS_SET- See Also:
-
CANON_VALUE_MASK
private static final int CANON_VALUE_MASK- See Also:
-
-
Constructor Details
-
Normalizer2Impl
public Normalizer2Impl()
-
-
Method Details
-
load
-
load
-
addLcccChars
-
addPropertyStarts
-
addCanonIterPropertyStarts
-
ensureCanonIterData
Builds the canonical-iterator data for this instance. This is required before any ofisCanonSegmentStarter(int)
orgetCanonStartSet(int, UnicodeSet)
are called, or else they crash.- Returns:
- this
-
getNorm16
public int getNorm16(int c) -
getRawNorm16
public int getRawNorm16(int c) -
getCompQuickCheck
public int getCompQuickCheck(int norm16) -
isAlgorithmicNoNo
public boolean isAlgorithmicNoNo(int norm16) -
isCompNo
public boolean isCompNo(int norm16) -
isDecompYes
public boolean isDecompYes(int norm16) -
getCC
public int getCC(int norm16) -
getCCFromNormalYesOrMaybe
public static int getCCFromNormalYesOrMaybe(int norm16) -
getCCFromYesOrMaybe
public static int getCCFromYesOrMaybe(int norm16) -
getCCFromYesOrMaybeCP
public int getCCFromYesOrMaybeCP(int c) -
getFCD16
public int getFCD16(int c) Returns the FCD data for code point c.- Parameters:
c
- A Unicode code point.- Returns:
- The lccc(c) in bits 15..8 and tccc(c) in bits 7..0.
-
singleLeadMightHaveNonZeroFCD16
public boolean singleLeadMightHaveNonZeroFCD16(int lead) Returns true if the single-or-lead code unit c might have non-zero FCD data. -
getFCD16FromNormData
public int getFCD16FromNormData(int c) Gets the FCD value from the regular normalization data. -
getDecomposition
Gets the decomposition for one code point.- Parameters:
c
- code point- Returns:
- c's decomposition, if it has one; returns null if it does not have a decomposition
-
getRawDecomposition
Gets the raw decomposition for one code point.- Parameters:
c
- code point- Returns:
- c's raw decomposition, if it has one; returns null if it does not have a decomposition
-
isCanonSegmentStarter
public boolean isCanonSegmentStarter(int c) Returns true if code point c starts a canonical-iterator string segment.ensureCanonIterData()
must have been called before this method, or else this method will crash.- Parameters:
c
- A Unicode code point.- Returns:
- true if c starts a canonical-iterator string segment.
-
getCanonStartSet
Returns true if there are characters whose decomposition starts with c. If so, then the set is cleared and then filled with those characters.ensureCanonIterData()
must have been called before this method, or else this method will crash.- Parameters:
c
- A Unicode code point.set
- A UnicodeSet to receive the characters whose decompositions start with c, if there are any.- Returns:
- true if there are characters whose decomposition starts with c.
-
decompose
-
decompose
public void decompose(CharSequence s, int src, int limit, StringBuilder dest, int destLengthEstimate) Decomposes s[src, limit[ and writes the result to dest. limit can be NULL if src is NUL-terminated. destLengthEstimate is the initial dest buffer capacity and can be -1. -
decompose
-
decomposeAndAppend
public void decomposeAndAppend(CharSequence s, boolean doDecompose, Normalizer2Impl.ReorderingBuffer buffer) -
compose
public boolean compose(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doCompose, Normalizer2Impl.ReorderingBuffer buffer) -
composeQuickCheck
public int composeQuickCheck(CharSequence s, int src, int limit, boolean onlyContiguous, boolean doSpan) Very similar to compose(): Make the same changes in both places if relevant. doSpan: spanQuickCheckYes (ignore bit 0 of the return value) !doSpan: quickCheck- Returns:
- bits 31..1: spanQuickCheckYes (==s.length() if "yes") and bit 0: set if "maybe"; otherwise, if the span length<s.length() then the quick check result is "no"
-
composeAndAppend
public void composeAndAppend(CharSequence s, boolean doCompose, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer) -
makeFCD
-
makeFCDAndAppend
public void makeFCDAndAppend(CharSequence s, boolean doMakeFCD, Normalizer2Impl.ReorderingBuffer buffer) -
hasDecompBoundaryBefore
public boolean hasDecompBoundaryBefore(int c) -
norm16HasDecompBoundaryBefore
public boolean norm16HasDecompBoundaryBefore(int norm16) -
hasDecompBoundaryAfter
public boolean hasDecompBoundaryAfter(int c) -
norm16HasDecompBoundaryAfter
public boolean norm16HasDecompBoundaryAfter(int norm16) -
isDecompInert
public boolean isDecompInert(int c) -
hasCompBoundaryBefore
public boolean hasCompBoundaryBefore(int c) -
hasCompBoundaryAfter
public boolean hasCompBoundaryAfter(int c, boolean onlyContiguous) -
isCompInert
public boolean isCompInert(int c, boolean onlyContiguous) -
hasFCDBoundaryBefore
public boolean hasFCDBoundaryBefore(int c) -
hasFCDBoundaryAfter
public boolean hasFCDBoundaryAfter(int c) -
isFCDInert
public boolean isFCDInert(int c) -
isMaybe
private boolean isMaybe(int norm16) -
isMaybeOrNonZeroCC
private boolean isMaybeOrNonZeroCC(int norm16) -
isInert
private static boolean isInert(int norm16) -
isJamoL
private static boolean isJamoL(int norm16) -
isJamoVT
private static boolean isJamoVT(int norm16) -
hangulLVT
private int hangulLVT() -
isHangulLV
private boolean isHangulLV(int norm16) -
isHangulLVT
private boolean isHangulLVT(int norm16) -
isCompYesAndZeroCC
private boolean isCompYesAndZeroCC(int norm16) -
isDecompYesAndZeroCC
private boolean isDecompYesAndZeroCC(int norm16) -
isMostDecompYesAndZeroCC
private boolean isMostDecompYesAndZeroCC(int norm16) A little faster and simpler than isDecompYesAndZeroCC() but does not include the MaybeYes which combine-forward and have ccc=0. (Standard Unicode 10 normalization does not have such characters.) -
isDecompNoAlgorithmic
private boolean isDecompNoAlgorithmic(int norm16) -
getCCFromNoNo
private int getCCFromNoNo(int norm16) -
getTrailCCFromCompYesAndZeroCC
int getTrailCCFromCompYesAndZeroCC(int norm16) -
mapAlgorithmic
private int mapAlgorithmic(int c, int norm16) -
getCompositionsListForDecompYes
private int getCompositionsListForDecompYes(int norm16) - Returns:
- index into maybeYesCompositions, or -1
-
getCompositionsListForComposite
private int getCompositionsListForComposite(int norm16) - Returns:
- index into maybeYesCompositions
-
getCompositionsListForMaybe
private int getCompositionsListForMaybe(int norm16) -
getCompositionsList
private int getCompositionsList(int norm16) - Parameters:
c
- code point must have compositions- Returns:
- index into maybeYesCompositions
-
decomposeShort
private int decomposeShort(CharSequence s, int src, int limit, boolean stopAtCompBoundary, boolean onlyContiguous, Normalizer2Impl.ReorderingBuffer buffer) -
decompose
-
combine
Finds the recomposition result for a forward-combining "lead" character, specified with a pointer to its compositions list, and a backward-combining "trail" character.If the lead and trail characters combine, then this function returns the following "compositeAndFwd" value:
Bits 21..1 composite character Bit 0 set if the composite is a forward-combining starter
otherwise it returns -1.The compositions list has (trail, compositeAndFwd) pair entries, encoded as either pairs or triples of 16-bit units. The last entry has the high bit of its first unit set.
The list is sorted by ascending trail characters (there are no duplicates). A linear search is used.
See normalizer2impl.h for a more detailed description of the compositions list format.
-
addComposites
- Parameters:
list
- some character's compositions listset
- recursively receives the composites from these compositions
-
recompose
private void recompose(Normalizer2Impl.ReorderingBuffer buffer, int recomposeStartIndex, boolean onlyContiguous) -
composePair
public int composePair(int a, int b) -
hasCompBoundaryBefore
private boolean hasCompBoundaryBefore(int c, int norm16) Does c have a composition boundary before it? True if its decomposition begins with a character that has ccc=0 && NFC_QC=Yes (isCompYesAndZeroCC()). As a shortcut, this is true if c itself has ccc=0 && NFC_QC=Yes (isCompYesAndZeroCC()) so we need not decompose. -
norm16HasCompBoundaryBefore
private boolean norm16HasCompBoundaryBefore(int norm16) -
hasCompBoundaryBefore
-
norm16HasCompBoundaryAfter
private boolean norm16HasCompBoundaryAfter(int norm16, boolean onlyContiguous) -
hasCompBoundaryAfter
-
isTrailCC01ForCompBoundaryAfter
private boolean isTrailCC01ForCompBoundaryAfter(int norm16) For FCC: Given norm16 HAS_COMP_BOUNDARY_AFTER, does it have tccc<=1? -
findPreviousCompBoundary
-
findNextCompBoundary
-
findPreviousFCDBoundary
-
findNextFCDBoundary
-
getPreviousTrailCC
-
addToStartSet
-