Class ScriptIterator
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ScriptIterator
An iterator that locates ISO 15924 script boundaries in text.
This is not the same as simply looking at the Unicode block, or even the Script property. Some characters are 'common' across multiple scripts, and some 'inherit' the script value of text surrounding them.
This is similar to ICU (internal-only) UScriptRun, with the following differences:
- Doesn't attempt to match paired punctuation. For tokenization purposes, this is not necessary. It's also quite expensive.
- Non-spacing marks inherit the script of their base character, following recommendations from UTR #24.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final int[]
linear fast-path for basic latin caseprivate final boolean
private int
private int
private int
private int
private int
private int
private char[]
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate int
getScript
(int codepoint) fast version of UScript.getScript().(package private) int
Get the UScript script code for this script run(package private) int
Get the index of the first character after the end of this script run(package private) int
Get the start of this script runprivate static boolean
isCombiningMark
(int codepoint) Determine if codepoint is a combining mark (General_Category of Mc, Mn, Me)private static boolean
isSameScript
(int currentScript, int script, int codepoint) Determine if two scripts are compatible.(package private) boolean
next()
Iterates to the next script run, returning true if one exists.(package private) void
setText
(char[] text, int start, int length) Set a new region of text to be examined by this iterator
-
Field Details
-
text
private char[] text -
start
private int start -
limit
private int limit -
index
private int index -
scriptStart
private int scriptStart -
scriptLimit
private int scriptLimit -
scriptCode
private int scriptCode -
combineCJ
private final boolean combineCJ -
basicLatin
private static final int[] basicLatinlinear fast-path for basic latin case
-
-
Constructor Details
-
ScriptIterator
ScriptIterator(boolean combineCJ) - Parameters:
combineCJ
- if true: Han,Hiragana,Katakana will all return asUScript.JAPANESE
-
-
Method Details
-
getScriptStart
int getScriptStart()Get the start of this script run- Returns:
- start position of script run
-
getScriptLimit
int getScriptLimit()Get the index of the first character after the end of this script run- Returns:
- position of the first character after this script run
-
getScriptCode
int getScriptCode()Get the UScript script code for this script run- Returns:
- code for the script of the current run
-
next
boolean next()Iterates to the next script run, returning true if one exists.- Returns:
- true if there is another script run, false otherwise.
-
isSameScript
private static boolean isSameScript(int currentScript, int script, int codepoint) Determine if two scripts are compatible. -
isCombiningMark
private static boolean isCombiningMark(int codepoint) Determine if codepoint is a combining mark (General_Category of Mc, Mn, Me) -
setText
void setText(char[] text, int start, int length) Set a new region of text to be examined by this iterator- Parameters:
text
- text buffer to examinestart
- offset into bufferlength
- maximum length to examine
-
getScript
private int getScript(int codepoint) fast version of UScript.getScript(). Basic Latin is an array lookup
-