Jim McBeath: August 2008

Monday, August 25, 2008

A Character Encoding Problem

A while back I helped a colleague track down a character encoding problem in a Java application over which he had been pulling his hair out for two days. It was a fun little diversion that took us a couple of hours working together to track down. I thought I would share it here in case someone else had a similar problem.

This situation was this: a customer had sent us a set of files and a CSV-formatted spreadsheet containing the names of the files. The application read the CSV file and used that data to look up the appropriate file from the set of names specified. The application was working fine on MacOSX, but was failing on Linux with a file-not-found error for some files. My colleague had written a simple test program that read the CSV file and looked up the problem file, and it displayed the same behavior, working on MacOSX and failing on Linux. The file in question contained the e-acute (é) character, so it was pretty clear the problem had something to do with character encoding, but the exact problem was not obvious.

The test program would read in the CSV file and display the filename, which looked right. Doing an "ls" on the directory containing the file, and likewise using listFiles() in Java to get and then print the filename, also looked right. But when the test program was modified to compare the String from the CSV file with the String from listFiles(), they compared false, even though visually they looked identical.

It turns out that the e-acute character has two separate representations in Unicode: as the precomposed character U+00E9, or as the two-code sequence of U+0065 (plain e) following by the composing code U+0301 (combining acute) (see the example about composite characters in Wikipedia). The CSV file contained the single-code precomposed character, but listFiles() was returning the two-code composite sequence, so the string comparison returned false.

I thought we had it figured out then, but that wasn't quite it. Even though the CSV file string was comparing as not equal to the filename returned by listFiles(), the test program was still able to open the file on MacOSX. Apparently the filesystem code was a bit smarter than Java and was able to know that the two forms of e-acute were in fact the same. But it failed when run on Linux.

It turned out that the data files had been delivered to us packaged in a RAR file, which my colleague had unpacked (using unrar) on MacOSX, then copied over to the Linux system. When he instead unpacked the original rar archive on the Linux system, lo and behold the application (and the test app) worked! Apparently the unrar program did the right thing when handling the e-acute character on Linux, whereas simply copying the file over from the Mac system did not.

Java 6 has support (class java.text.Normalizer) for Unicode text normalization. We were still using Java 5, so this was not available to us. IBM has an open-source library called ICU (International Components for Unicode) that contains the class com.ibm.icu.text.Normalizer which might have solved the problem for us, but once we realized that the issue was resolved by unpacking the files directly on the target machine, that was a satisfactory solution so we did not pursue other solutions.

Lessons:

Just because it looks the same on the screen doesn't mean it is the same string.
Just because it's Unicode doesn't mean it is a unique encoding. It still has to be normalized.
Java's 16-bit-character Unicode encoding does not magically solve all character encoding problems. You still need to understand character encoding issues and deal with those problems.

Monday, August 18, 2008

StringArt Scala Applet

After writing a Scala application from scratch and converting a Java application over to Scala, I wanted to see how easy it would be to write a Scala applet. It turned out to be very easy.

For my applet, I selected a small project I had wanted to do for a long time: a string art drawing program. My StringArt program is a simple applet that recreates an art form I did as a little kid: pound a bunch of nails into a board in a simple pattern, then stretch pieces of string between the nails. In order to allow maximum flexibility, I wanted to allow the user to enter functions describing the endpoints of the lines. Scala made this particularly easy to do with Parser Combinators in the standard library, but that's a different issue.

Creating an applet in Scala was exactly like creating an applet in Java with the exception that I needed to provide access to the standard Scala library. There were thus three steps involved:

Create a suitable HTML file.
Write the Scala code.
Provide access to the Scala library.

The HTML File

Setting up an applet requires adding an <applet> element to an HTML file. Here's the <applet> element as used in the HTML file for my StringArt applet:

  <applet
   code="net.jimmc.stringart.Main"
   archive="stringart.jar,scala-lib-stringart.jar"
   width="660"
   height="500"
  >
  Sorry, this browser does not understand applets or they are not enabled.
  </applet>

As with any applet, the "code" attribute specifies the name of the main class in the applet, and the "archive" attribute specifies the jar file containing the applet code. In this case the "archive" attribute additionally specifies a second jar file that contains the standard Scala library classes used by my applet. More on this below.

The Scala Code

My main applet class extends javax.swing.JApplet. I chose to implement the body of the applet in a separate class (StringArt) that extends JPanel to make it simpler to use it later as a standalone app.

Here is the main applet class:

package net.jimmc.stringart

import javax.swing.JApplet

class Main extends JApplet {
    val sa = new StringArt()

    override def init() {
        val pane = getContentPane()
        pane.add("Center", sa)
    }
}

The StringArt class and everything from there on down uses standard Swing components. My Main applet class also includes the standard applet start() and stop() methods, but they don't do anything other than print out a status message, so I did not include them in the above listing.

The Scala Library

When I first started working on my StringArt applet, the "archive" attribute in my <applet> entity referenced the standard Scala library jar file, scala-library.jar. But that jar file is 3.3MB, and my little applet jar file was only 200KB, so the initial download of the applet was much slower than it needed to be. To improve on this situation, I created a custom version of the Scala library jar file that contained only the classes used by my applet. I had come across a Minesweeper applet written in Scala that used this idea. It referenced a scala-minimal.jar that was only 220KB, but the author did not say how he created it.

I posed the question to the Scala mailing list and was directed to the free program ProGuard that, among many other things, can remove unused classes from jar files. My build.xml file for the StringArt applet now includes a target that produces my minimal Scala library jar file with this ant command:

    <java   jar="${proguard.jar}"
            fork="true"
            failonerror="true"
    >
        <arg line="-injars ${stringart.jar}"/>
        <arg line="-outjars ${stringart.jar}.ignore"/>
        <arg line="-injars ${scala.library.jar}"/>
        <arg line="-outjars ${scala-lib-stringart.jar}"/>
        <arg line="-libraryjars ${java-rt.jar}"/>
        <arg line="-keep public class ${main.class}"/>
        <arg line="-keep public class scala.ScalaObject"/>
        <arg line="-keep class net.jimmc.**"/>
        <arg line="-keep public class java.**"/>
        <arg line="-keep public class javax.**"/>
        <arg line="-keepnames class **"/>
        <arg line="-dontoptimize -dontobfuscate -dontpreverify"/>
    </java>

After building my applet stringart.jar file, I run the above command, throw away the processed stringart.jar.ignore file and use the processed scala-library.jar file as my scala-lib-stringart.jar file. The reduced library is under 290KB, less than 1/10th the size of the standard scala-library.jar. The total jar file download size (stringart.jar plus scala-lib-stringart.jar) is about 500KB, rather than 3.5MB as it was with the complete scala-library.jar, so there is only 1/7th as much data to download to be able to run the applet.

Updated 2009-01-07: added links to initial bullet list, changed Parser Combinator link to point to my blog entry.

Monday, August 11, 2008

Scala Traits

One of my favorite Scala features is traits. To a Java programmer, a trait is like an interface with implementation. The potential ambiguity of multiple inheritance is solved by defining a linearization order for all traits and superclasses. Given that ordering, you can figure out which classes or traits override which others. Diamond inheritance is not a problem.

Traits for Shared Code

For my Java/Swing program JRaceman, I defined a set of GUI classes to simplify my programmatic construction of screens. Each class took as constructor parameters a ResourceSource that provided access to resources (for localization), a resource key prefix, and an action to take when that GUI element was selected (such as a button push or menu item selection). For example, my MenuAction class looked something like this:

//Java code
public class MenuAction extends JMenuItem implements ActionListener {
    public MenuAction(ResourceSource resourceSource, String resourcePrefix) {
        //Look up resources for label, toolTip and set those values
        addActionListener(this);
    }
    public void actionPerformed(ActionEvent ev) {
        action();
    }
    public void action() { /* do nothing */ }
}

I have left out all the details from the above listing. The total file size for this class is about 140 lines.

The application would instantiate a menu item something like this:

//Java code
    MenuAction b = new MenuAction(resourceSource,"menu.Foo.Open") {
        public void action() { fooOpen(); }
    };

I have similar Java classes for ButtonAction, CheckBoxAction, and CheckBoxMenuAction. Each of these has a lot of similar code to set labels and toolTips, as well as other things such as handling exceptions while executing the action. Because each of my classes extended a different Swing class, I was unable to effectively share this code, so there is a bunch of duplicated code in these classes.

For my Mimprint photo printing application, which I converted from Java to Scala, my equivalent class to the MenuAction Java class looks something like this:

class SMenuItem(rSource:SResources, rPrefix:String)(action: =>Unit)
        extends JMenuItem(rSource.getResourceString(rPrefix+".label"))
        with SComponent with SCompToolPrompt {
    setupToolTip(rSource, rPrefix)
    setupToolPrompt(rSource, rPrefix)
    setupIcon(rSource, rPrefix)
    setupActionListener(rSource, action)
}

That's it. That's the entire class. Less than ten lines of code. The actual file, with comments, is 18 lines long. Almost the entire contents of that 140-line MenuAction.java file has been moved into the two traits SComponent and SCompToolPrompt, where the code can be shared with my Scala implementations of Button, CheckBox, etc. The SMenuItem, SComponent and SCompToolPrompt classes in Scala are together about the same size as the MenuAction class in Java, so by itself this is not a win for code size, but now I can use SComponent and SCompToolPrompt for my other Swing wrapper classes, so each of those classes reduces down from around 100 to less than 20 lines of code with no corresponding increase in any other files.

Even without the improvement in lines of code, I like the ability to break up one class into multiple files for organizational purposes.

The application invokes SMenuItem something like this:

    val mi = new SMenuItem(rSource, "menu.Foo.Open")(fooOpen)

With the shared code for my GUI components now housed in one place, such that I no longer had to maintain multiple copies of the code, I was also motivated to add more features to that file, such as better handling of an exception during the action. It also made it much easier to more uniformly apply the features supported by SComponent. For example, my Java GUI class did not support icons; but SComponent.scala does include icon support, so I can add icon support to my Scala GUI classes just by adding the one line call to setupIcon.

Traits as Facade

I have also used traits as a nice way to implement the Facade pattern. In the above examples, the SResources type is actually a trait (shown here without the boilerplate and comments that appear in the real source file):

trait SResources {
    def getResourceString(key:String) : String
    def getResourceStringOption(key: String) : Option[String]
    def getResourceFormatted(key: String, arg: Any) : String
    def getResourceFormatted(key: String, arg: Array[Any]) : String
}

I then implemented a SResourcesFacade trait to allow a class to act as an SResources by delegating all calls to a member object that in turn implements the SResources trait. In this way, I can have one real source for resources, then easily turn other classes into proxies for that resource source.

trait SResourcesFacade extends SResources {
    //Extending class must define this value
    protected val sResourcesBase : SResources

    def getResourceString(key:String) =
            sResourcesBase.getResourceString(key)
    def getResourceStringOption(key: String) =
            sResourcesBase.getResourceStringOption(key)
    def getResourceFormatted(key: String, arg: Any) =
            sResourcesBase.getResourceFormatted(key, arg)
    def getResourceFormatted(key: String, args: Array[Any]) =
            sResourcesBase.getResourceFormatted(key, args)
}

Now to make a class into an SResources class, I only need to make the class extend SResourcesFacade and define a value for sResourcesBase, like this:

class Foo extends SResourcesFacade {
    val sResourcesBase : SResources = new RealResources()
}

where RealResources is a class that implements the SResources trait.

This particular facade only has four methods, but this same technique would work for a facade with any number of methods.

You can find the complete Scala code for the above examples (which is not exactly the same as the code given here) in the packages net.jimmc.swing (for the GUI examples) and net.jimmc.util (for the facade examples) in the sources to Mimprint, which is distributed under the GPL.

Thursday, August 7, 2008

The Best Policy

I am a bit of a stickler for truth. In fact, I have had people tell me that I am "too honest". I try to say only what I mean and mean what I say. I read contracts before signing them, and I don't sign if I don't intend to follow through. I don't make commitments that I don't intend to meet, and I try very hard to meet all the commitments that I make. I try to objectively evaluate new evidence, even when it initially appears inconsistent with my current beliefs, keeping in mind the possibility that those beliefs might be wrong.

I believe keeping to the truth, even on the small things, is important. Even without taking into account the moral issues, I find the practical evaluation of being truthful to be straightforward:

Reputation is important and valuable. If you say things that are untrue, even small things, and people learn that this is the case, they will not be as trusting of your statements in general, and it will be harder to maintain a good relationship with them.
The real world is complicated enough to keep track of without the additional burden of remembering the details of the imaginary world that includes all of the little lies you tell. The more times you tell a lie, even "little white lies", the harder it is to keep all of your lies consistent with each other and the real world, and the more likely that you will slip up and reveal that some of your statements were lies, after which see the previous point.
Even if you don't slip up to the point where someone else recognizes it, you may internally confuse the real world with your imaginary world that includes your lies and end up reaching conclusions or making decisions that are incorrect, inappropriate, or at the least suboptimal for the real world.
Keeping track of your imaginary world and ensuring that you don't make the mistakes of the previous point uses up mental energy that could be used for more constructive pursuits.

In the long run, honesty is easier as well as more rewarding. In my book it is clearly the best policy.

Jim McBeath