Jim McBeath: A Character Encoding Problem

A while back I helped a colleague track down a character encoding problem in a Java application over which he had been pulling his hair out for two days. It was a fun little diversion that took us a couple of hours working together to track down. I thought I would share it here in case someone else had a similar problem.

This situation was this: a customer had sent us a set of files and a CSV-formatted spreadsheet containing the names of the files. The application read the CSV file and used that data to look up the appropriate file from the set of names specified. The application was working fine on MacOSX, but was failing on Linux with a file-not-found error for some files. My colleague had written a simple test program that read the CSV file and looked up the problem file, and it displayed the same behavior, working on MacOSX and failing on Linux. The file in question contained the e-acute (é) character, so it was pretty clear the problem had something to do with character encoding, but the exact problem was not obvious.

The test program would read in the CSV file and display the filename, which looked right. Doing an "ls" on the directory containing the file, and likewise using listFiles() in Java to get and then print the filename, also looked right. But when the test program was modified to compare the String from the CSV file with the String from listFiles(), they compared false, even though visually they looked identical.

It turns out that the e-acute character has two separate representations in Unicode: as the precomposed character U+00E9, or as the two-code sequence of U+0065 (plain e) following by the composing code U+0301 (combining acute) (see the example about composite characters in Wikipedia). The CSV file contained the single-code precomposed character, but listFiles() was returning the two-code composite sequence, so the string comparison returned false.

I thought we had it figured out then, but that wasn't quite it. Even though the CSV file string was comparing as not equal to the filename returned by listFiles(), the test program was still able to open the file on MacOSX. Apparently the filesystem code was a bit smarter than Java and was able to know that the two forms of e-acute were in fact the same. But it failed when run on Linux.

It turned out that the data files had been delivered to us packaged in a RAR file, which my colleague had unpacked (using unrar) on MacOSX, then copied over to the Linux system. When he instead unpacked the original rar archive on the Linux system, lo and behold the application (and the test app) worked! Apparently the unrar program did the right thing when handling the e-acute character on Linux, whereas simply copying the file over from the Mac system did not.

Java 6 has support (class java.text.Normalizer) for Unicode text normalization. We were still using Java 5, so this was not available to us. IBM has an open-source library called ICU (International Components for Unicode) that contains the class com.ibm.icu.text.Normalizer which might have solved the problem for us, but once we realized that the issue was resolved by unpacking the files directly on the target machine, that was a satisfactory solution so we did not pursue other solutions.

Lessons:

Just because it looks the same on the screen doesn't mean it is the same string.
Just because it's Unicode doesn't mean it is a unique encoding. It still has to be normalized.
Java's 16-bit-character Unicode encoding does not magically solve all character encoding problems. You still need to understand character encoding issues and deal with those problems.

Jim McBeath

Monday, August 25, 2008

A Character Encoding Problem

No comments:

Blog Archive

Labels

License

About Me