public class ChineseUtils
extends java.lang.Object
Modifier and Type | Field and Description |
---|---|
static int |
ASCII |
static int |
DELETE |
static int |
DELETE_EXCEPT_BETWEEN_ASCII |
static int |
FULLWIDTH |
static int |
LEAVE |
static int |
MAX_LEGAL |
static java.lang.String |
MID_DOT_REGEX_STR |
static int |
NORMALIZE |
static java.lang.String |
NUMBERS |
static java.lang.String |
ONEWHITE |
static java.lang.String |
WHITE |
static java.lang.String |
WHITEPLUS |
Modifier and Type | Method and Description |
---|---|
static boolean |
isNumber(char c) |
static void |
main(java.lang.String[] args)
Mainly for testing.
|
static java.lang.String |
normalize(java.lang.String in) |
static java.lang.String |
normalize(java.lang.String in,
int ascii,
int spaceChar) |
static java.lang.String |
normalize(java.lang.String in,
int ascii,
int spaceChar,
int midDot)
This will normalize a Unicode String in various ways.
|
static java.lang.String |
shapeOf(java.lang.CharSequence input,
boolean augmentedDateChars,
boolean useMidDotShape) |
public static final java.lang.String ONEWHITE
public static final java.lang.String WHITE
public static final java.lang.String WHITEPLUS
public static final java.lang.String NUMBERS
public static final java.lang.String MID_DOT_REGEX_STR
public static final int LEAVE
public static final int ASCII
public static final int NORMALIZE
public static final int FULLWIDTH
public static final int DELETE
public static final int DELETE_EXCEPT_BETWEEN_ASCII
public static final int MAX_LEGAL
public static boolean isNumber(char c)
public static java.lang.String normalize(java.lang.String in)
public static java.lang.String normalize(java.lang.String in, int ascii, int spaceChar)
public static java.lang.String normalize(java.lang.String in, int ascii, int spaceChar, int midDot)
in
- The String to be normalizedascii
- For characters conceptually in the ASCII range of
! through ~ (U+0021 through U+007E or U+FF01 through U+FF5E),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them from the Chinese Full Width range
to ASCII values, and if it is FULLWIDTH then do the reverse.spaceChar
- For characters that satisfy Character.isSpaceChar(),
if this is ChineseUtils.LEAVE, then do nothing,
if it is ASCII then map them to the space character U+0020, and
if it is FULLWIDTH then map them to U+3000.midDot
- For a set of 7 characters that are roughly middle dot characters,
if this is ChineseUtils.LEAVE, then do nothing,
if it is NORMALIZE then map them to the extended Latin character U+00B7, and
if it is FULLWIDTH then map them to U+30FB.public static void main(java.lang.String[] args) throws java.io.IOException
ChineseUtils ascii spaceChar word*
ascii and spaceChar are integers: 0 = leave, 1 = ascii, 2 = fullwidth.
The words listed are then normalized and sent to stdout.
If no words are given, the program reads from and normalizes stdin.
Input is assumed to be in UTF-8.args
- Command line arguments as abovejava.io.IOException
- If any problems accessing command-line filespublic static java.lang.String shapeOf(java.lang.CharSequence input, boolean augmentedDateChars, boolean useMidDotShape)