Formatters and Java Print Streams - Character Sets
(Page 2 of 5 )
So far, I haven’t paid a lot of attention to character set issues. As long as you stick to the ASCII character set, a single computer, and System.out, character sets aren’t likely to be a problem. However, as data begins to move between different systems, it becomes important to consider what happens when the other systems use different character sets. For example, suppose I use a Formatteror aPrintStream on a typical U.S. or Western European PC to write the sentence “Au cours des dernières années, XML a été adapte dans des domaines aussi diverse que l’aéronautique, le multimédia, la gestion de hôpitaux, les télécommunications, la théologie, la vente au détail, et la littérature médiévale” in a file. Say I then send this file to a Macintosh user, who opens it up and sees “Au cours des derniËres annÈes, XML a ÈtÈ adapte dans des domaines aussi diverse que l’aÈronautique, le multimÈdia, la gestion de hÙpitaux, les tÈlÈcommunications, la thÈologie, la vente au dÈtail, et la littÈrature mÈdiÈvale.” This is not the same thing at all! The confusion is even worse if you go in the other direction.
If you’re writing to the console (i.e.,System.out), you don’t really need to worry about character set issues. The default character set Java writes in is usually the same one the console uses.
Actually, you may need to worry a little. On Windows, the console encoding is usually not the same as the system encoding found in thefile.encodingsystem property. In particular, the console uses a DOS character set such as Cp850 that includes box drawing characters such as L and +, while the rest of the system uses an encoding such as Cp1252 that maps these same code points to alphabetic characters like È and Î. To be honest, the console is reliable enough for ASCII, but anything beyond that requires a GUI.
However, there’s more than one character set, and when transmitting files between systems and programs, it pays to be specific. In the previous example, if we knew the file was going to be read on a Macintosh, we might have specified that it be written with the MacRoman encoding:
Formatter formatter = new Formatter("data.txt", "MacRoman");
More likely, we’d just agree on both the sending and receiving ends to use some neutral format such as ISO-8859-1 or UTF-8. In some cases, encoding details can be embedded in the file you write (HTML, XML) or sent as out-of-band metadata (HTTP, SMTP). However, you do need some way of specifying and communicating the character set in which any given document is written. When you’re writing to anything other than the console or a string, you should almost always specify an encoding explicitly. Three of theFormatterconstructors take character set names as their second argument:
public Formatter(String fileName, String characterSet)
throws FileNotFoundException
public Formatter(File file , String characterSet)
throws FileNotFoundException
public Formatter(OutputStream out, String characterSet)
I’ll have more to say about character sets in Chapter 19.
Next: Locales >>
More Java Articles
More By O'Reilly Media
|
This article is excerpted from chapter seven of Java I/O, Second Edition, written by Elliotte Rusty Harold (O'Reilly, 2006; ISBN: 0596527500). Check it out today at your favorite bookstore. Buy this book now.
|
|