Module | Mail::Multibyte::Unicode |
In: |
lib/mail/multibyte/unicode.rb
|
NORMALIZATION_FORMS | = | [:c, :kc, :d, :kd] | A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization. | |
UNICODE_VERSION | = | '5.2.0' | The Unicode version that is supported by the implementation | |
HANGUL_SBASE | = | 0xAC00 | Hangul character boundaries and properties | |
HANGUL_LBASE | = | 0x1100 | ||
HANGUL_VBASE | = | 0x1161 | ||
HANGUL_TBASE | = | 0x11A7 | ||
HANGUL_LCOUNT | = | 19 | ||
HANGUL_VCOUNT | = | 21 | ||
HANGUL_TCOUNT | = | 28 | ||
HANGUL_NCOUNT | = | HANGUL_VCOUNT * HANGUL_TCOUNT | ||
HANGUL_SCOUNT | = | 11172 | ||
HANGUL_SLAST | = | HANGUL_SBASE + HANGUL_SCOUNT | ||
HANGUL_JAMO_FIRST | = | 0x1100 | ||
HANGUL_JAMO_LAST | = | 0x11FF | ||
WHITESPACE | = | [ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze | All the unicode whitespace | |
LEADERS_AND_TRAILERS | = | WHITESPACE + [65279] | BOM (byte order mark) can also be seen as whitespace, it‘s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored. | |
TRAILERS_PAT | = | /(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+\Z/u | ||
LEADERS_PAT | = | /\A(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+/u |
default_normalization_form | [RW] |
The default normalization used for operations that require normalization.
It can be set to any of the normalizations in NORMALIZATION_FORMS.
Example: Mail::Multibyte::Unicode.default_normalization_form = :c |
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]] Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]
Detect whether the codepoint is in a certain character class. Returns true when it‘s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true will forcibly tidy all bytes, assuming that the string‘s encoding is entirely CP1252 or ISO-8859-1.
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn‘t valid UTF-8.
Example:
Unicode.u_unpack('Café') # => [67, 97, 102, 233]