Internationalization in Java

 

This is a short document about how to use non-English character-set encodings, and Unicode, in Java.  It’s biased toward people working on Chinese, but applies to everyone.

 

Java internally

 

Internally, Java is ALWAYS unicode.  That means that the char primitive is always Unicode, as well as the text content of the following classes:

 

String

java.util.regex.Pattern

java.util.regex.Matcher

all Readers (StringReader, BufferedReader, ...)

all Writers (StringWriter, PrintWriter, ...)

 

So once you've got any text in these classes you should always be treating it as unicode. For Chinese, this means that it will always be one character per Chinese character.  So there is no formal distinction between, say Big5, GB, and Unicode representations of Chinese characters internal to Java.

 

Internal to its own world, Java handles input and output using the Reader and Writer classes (plus their relevant subclasses), which read from and write to character streams.  As an NLP person, this is generally the world you want to be thinking in.

 

Connecting Java with the outside world

 

In the world external to Java, all characters are not created equal.  The mapping from an abstract character to a readable or writable bit-sequence is specified by a character set encoding.  Except for certain minimal extensions of the ASCII character set encoding, most character set encodings do not map each character to an identical-length bit sequence.  For example, Chinese characters in Big5 (Taiwan; traditional characters) or GB (PRC; simplified characters) are two bytes apiece.  Java interacts directly with this outside world in terms of byte-streams, including the InputStream and OutputStream classes, which respectively allow you to read and write bytes from/to various sources. 

 

But as someone working on natural language, you normally want to avoid thinking of the outside world in terms of byte streams.  You want to think of it in terms of character streams.  You can do this worry-free by using two crucial classes that connect the two worlds:

 

  InputStreamReader, which connects an InputStream to a Reader

 

  OutputStreamWriter, which connects a Writer to an OutputStream

 

There are constructors for these two classes that allow you to specify a character encoding:

 

  InputStreamReader(InputStream in, String charsetName)

 

  OutputStreamWriter(OutputStream out, String charsetName)

 

So if I wanted to read GB (simplified Chinese) text from a file <filename>, I would use

 

  (1) a FileInputStream, together with

 

  (2) an InputStreamReader

 

and I would do

 

  InputStream inputStream = new FileInputStream("<filename>");

  Reader r = new InputStreamReader(inputStream,"GB18030");

 

and then I would just read characters from the Reader r however I wanted. (I might also want to wrap r in a BufferedReader to gain access to some convenience methods.)

 

This webpage lists the character-set encodings supported by Java 1.5.