Monday, March 28, 2011

Java NIO for Character Decoding in Scala

The Java NIO package includes some handy character encoding and decoding methods that can be used from Scala.

Contents

Background

In my previous post I described a simple Scala server using NIO and continuations, and mentioned in the Limitations section that the example did not convert the data bytes to characters. In this post I show how that can easily be added by using another feature of the Java NIO package: character-set encoders and decoders.

Java NIO Character Coders

The java.nio.charset package includes a Charset class that represents a mapping between the 16-bit Unicode code-units that Java uses for its internal representation for characters and strings, and a sequence of bytes as are stored in a file or transmitted through a socket connection. Each such mapping is represented by a separate instance of the Charset class. Standard character mappings such as "UTF-8" and "ISO-8859-1" can be retrieved using the static forName method.

Given an instance of Charset, a CharsetEncoder for that character mapping can be retrieved by calling the newEncoder method on that instance. That encoder can then be used to convert a Java string into a sequence of bytes suitable for writing to a file or connection.

Similarly, the newDecoder method on Charset retrieves a CharsetDecoder that can be used for the complementary task of converting bytes from a file or connection into a Java string.

The encoding and decoding methods convert data between a CharBuffer and a ByteBuffer. Since the java.nio socket I/O calls we are using read and write their data to and from ByteBuffers, it is convenient for the encoding and decoding to use those objects.

LineDecoder

Using the java.nio.charset classes described above, we write a LineDecoder class containing a processBytes method that takes as input a ByteBuffer (which is what we have to read into when using a SocketChannel) and converts that byte data to Java characters. For this example, we also break up that character data into separate lines when we see line break characters, converting each line of characters to a Java String. One buffer of data might contain multiple lines of character data, so rather than returning a set of lines, our method accepts a callback to which we pass each line as we decode it.

import java.nio.{ByteBuffer,CharBuffer}
import java.nio.charset.{Charset,CharsetDecoder,CharsetEncoder,CoderResult}
import scala.annotation.tailrec

class LineDecoder {

    //Encoders and decoders are not multi-thread safe, so create one
    //for each connection in case we are using multiple threads.
    val utf8Charset = Charset.forName("UTF-8")
    val utf8Encoder = utf8Charset.newEncoder
    val utf8Decoder = utf8Charset.newDecoder

    def processBytes(b:ByteBuffer, lineHandler:(String)=>Unit):Unit =
        processChars(utf8Decoder.decode(b),lineHandler)

    @tailrec
    private def processChars(cb:CharBuffer, lineHandler:(String)=>Unit) {
        val len = lengthOfFirstLine(cb)
        if (len>=0) {
            val ca = new Array[Char](len)
            cb.get(ca,0,len)
            eatLineEnding(cb)
            val line = new String(ca)
            lineHandler(line)
            processChars(cb, lineHandler)       //handle multiple lines
        }
    }

    //Assuming the first character in the buffer is an eol char,
    //consume it and a possible matching CR or LF in case the EOL is 2 chars.
    private def eatLineEnding(cb:CharBuffer) {
        //Eat the first character and see what it is
        cb.get match {
            case '\n' => if (cb.remaining>0 && cb.charAt(0)=='\r') cb.get
            case '\r' => if (cb.remaining>0 && cb.charAt(0)=='\n') cb.get
            case _ => //ignore everything else
        }
    }

    private def lengthOfFirstLine(cb:CharBuffer):Int = {
        (0 until cb.remaining) find { i =>
            List('\n','\r').indexOf(cb.charAt(i))>=0 } getOrElse -1
    }
}
Here is an imperative version of lengthOfFirstLine that does the same thing as the functional version above.
    private def lengthOfFirstLine(cb:CharBuffer):Int = {
        var cbLen = cb.remaining
        for (i <- 0 until cbLen) {
            val ch = cb.charAt(i)
            if (ch == '\n' || ch == '\r')
                return i
        }
        return -1
    }

NioConnection

One of the classes shown in my previous post was the NioConnection class, whose responsibilities include processing input data from the client. It does this in the method readAction, which initially looks like this:
//The old version
    private def readAction(b:ByteBuffer) {
        b.flip()
        socket.write(b)
        b.clear()
    }
We replace the direct call to socket.write with a call to LineDecoder.processBytes, which is responsible for decoding the input data, and we pass it our new writeLine method that accepts a line of characters and writes it back to the client. Also, we don't actually need the call to b.clear here, which is effectively at the bottom of our readWhile loop, since we call that method at the top of the loop.
    private val lineDecoder = new LineDecoder

    private def readAction(b:ByteBuffer) {
        b.flip()
        lineDecoder.processBytes(b, writeLine)
    }

    def writeLine(line:String) {
        socket.write(ByteBuffer.wrap((line+"\n").getBytes("UTF-8")))
    }
Now when we receive some input data, it gets passed to LineDecoder.processBytes, which converts it to characters, breaks it up into separate lines, and calls our writeLine method for each line. The writeLine method uses String.getBytes to convert the characters in the line back to bytes, wraps those bytes into a ByteBuffer and writes them directly to the output channel.

As compared to the example in the previous post, this example should behave the same externally, but we are now passing around Java strings rather than NIO buffers, which, assuming we want to deal with string data rather than binary data, will make it simpler to write the rest of the real application.

Limitations

  • As with the example in the previous post, the current example only shows how to use the NIO calls on the read side of the connection. We could use a CharsetEncoder on the write side rather than using String.getBytes and ByteBuffer.wrap.
  • Partial input lines (characters not terminated by an EOL character) are ignored by this implementation.
  • The example uses the convenience method version of decode, which assumes that the input ByteBuffer contains complete character sequences. It is possible that a multi-byte character sequence will be split such that only the first part of that sequence appears at the end of the input buffer, with the remainder of the sequence appearing at the start of the next buffer of input data. The above implementation will not properly handle this situation. The underlying decode method does handle this situation properly, but the remaining code in this example is not set up for this situation.
  • The decode convenience method throws exceptions rather than returning a status code as the full decode method does. Since these exceptions are nowhere caught in the code, such an exception would cause that task to abort. A more robust solution would have a mechanism to catch exceptions or restart an aborted task.
  • The example assumes UTF-8 encoding.

No comments: