Call it Java 1.5, 2.0, Java 5, Tiger, or what have you -- this version of Java has a lot to offer. This article covers just some of the new features. It is excerpted from chapter one of Java 1.5 Tiger: A Developer's Notebook, written by Brett McLaughlin and David Flanagan (O'Reilly, 2004; ISBN: 0596007388).
What`s New in Java 1.5 Tiger? - Taking Advantage of Better Unicode (Page 5 of 6 )
While many of the features in this chapter and the rest of the book focus on entirely new features, there are occasions where Tiger has simply evolved. The most significant of these is Unicode support. In pre-Tiger versions of Java, Unicode 3.0 was supported, and all of these Unicode characters fit into 16 bits (and therefore a char). Things are different, now, so you’ll need to understand a bit more.
How do I do that?
In Tiger, Java has moved to support Unicode 4.0, which defines several characters that don’t fit into 16 bits. This means that they won’t fit into a char acters inchar, and that has some far-reaching consequences. You’ll have to use int to represent these characters, and as a result methods like Character.isUpperCase() and Character.isWhitespace() now have variants that accept int arguments. So if you’re needing values in Unicode 3.0 that are not available in Unicode 3.0, you’ll need to use these new methods..
What just happened?
To really grasp all this, you have to understand a few basic terms:
codepoint
A codepoint is a number that represents a specific character. As an example, 0x3C0 is the codepoint for the symbol π.
Basic Multilingual Plan (BMP)
The BMP is all Unicode codepoints from \u0000 through \uFFFF. All of these codepoints fit into a Java char.
supplementary characters
These are the Unicode codepoints that fall outside of the BMP. There are 21-bit codepoints, with hex values from 010000 through 10FFFF, and must be represented by an int.
A char, then, represents a BMP Unicode codepoint. To get all the supplementary characters in addition to the BMP, you need to use an int. Of course, only the lowest 21 bits are used, as that’s all that is needed; the upper 21 bits are zeroed out.
All this assumes that you’re dealing with these characters in isolation, though, and that’s hardly the only use-case. More often, you’ve got to use these characters within the context of a larger String. In those situations, an int doesn’t fit, and instead two char values are encoded, and called a surrogate pair when linked like this. The first char is from the high-surrogates range (\uD800-\uDBFF), and the second char is from the low-surrogates range (\uDC00-\uDFFF). The net effect is that the number of chars in a String is not guaranteed to be the number of codepoints. Sometimes two chars represent a single codepoint (Unicode 4.0), and sometimes they represent two codepoints (Unicode 3.0).