Thursday, July 28, 2011

Debugging Scala Parser Combinators

Two simple mechanisms for debugging parsers written using Scala's parser combinators.

Contents

Introduction

In a recent comment on my 2008 blog post about Scala's parser combinators, a reader asked how one might go about debugging such a parser. As one post says, "Debugging a parser implemented with the help of a combinator library has its special challenges." You may have trouble setting breakpoints, and stack traces can be difficult to interpret.

The two techniques I show here may not provide you with the kind of visibility you might be used to when single-stepping through problem code, but I hope they provide at least a little more visibility than you might otherwise have.

Example Parser

As an example parser I will use an integer-only version of the four-function arithmetic parser I built for my 2008 parser combinator post. The code consists of a set of case classes to represent the parsed results and a parser class that contains the parsing rules and a few helper methods. You can copy this code into a file and either compile it or load it into the Scala REPL.

import scala.util.parsing.combinator.syntactical.StandardTokenParsers

sealed abstract class Expr {
    def eval():Int
}

case class EConst(value:Int) extends Expr {
    def eval():Int = value
}

case class EAdd(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval + right.eval
}

case class ESub(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval - right.eval
}

case class EMul(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval * right.eval
}

case class EDiv(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval / right.eval
}

case class EUMinus(e:Expr) extends Expr {
    def eval():Int = -e.eval
}

object ExprParser extends StandardTokenParsers {
    lexical.delimiters ++= List("+","-","*","/","(",")")

    def value = numericLit ^^ { s => EConst(s.toInt) }

    def parens:Parser[Expr] = "(" ~> expr <~ ")"

    def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) }

    def term = ( value |  parens | unaryMinus )

    def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = {
        level match {
            case 1 =>
                "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
                "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) }
            case 2 =>
                "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
                "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) }
            case _ => throw new RuntimeException("bad precedence level "+level)
        }
    }
    val minPrec = 1
    val maxPrec = 2

    def binary(level:Int):Parser[Expr] =
        if (level>maxPrec) term
        else binary(level+1) * binaryOp(level)

    def expr = ( binary(minPrec) | term )

    def parse(s:String) = {
        val tokens = new lexical.Scanner(s)
        phrase(expr)(tokens)
    }

    def apply(s:String):Expr = {
        parse(s) match {
            case Success(tree, _) => tree
            case e: NoSuccess =>
                   throw new IllegalArgumentException("Bad syntax: "+s)
        }
    }

    def test(exprstr: String) = {
        parse(exprstr) match {
            case Success(tree, _) =>
                println("Tree: "+tree)
                val v = tree.eval()
                println("Eval: "+v)
            case e: NoSuccess => Console.err.println(e)
        }
    }
    
    //A main method for testing
    def main(args: Array[String]) = test(args(0))
}
In the ExprParser class, the lines up to and including the definition of the expr method define the parsing rules, whereas the methods from parse onwards are helper methods.

Calling Individual Parsers

In our example parser we can easily ask it to parse a string by calling our ExprParser.test method, which parses the string using our parse method, prints the resulting parse, and (if the parse was successful) evaluates the parse tree and prints that value.

The last line of parse parses a string using our expression parser:
phrase(expr)(tokens)
phrase is a method in StandardTokenParsers that parses an input stream using the specified parser. The only thing special about our expr method is that we happen to have selected it as our top-level parser - but we could just as easily have picked one of our other parsers as our top-level parser.

Let's add another version of the test method that lets us specify which parser to use as the top-level parser. We want to print out the results in the same way as for the existing test method, so we first refactor that existing method:
def test(exprstr: String) =
        printParseResult(parse(exprstr))

    def printParseResult(pr:ParseResult[Expr]) = {
        pr match {
            case Success(tree, _) =>
                println("Tree: "+tree)
                val v = tree.eval()
                println("Eval: "+v)
            case e: NoSuccess => Console.err.println(e)
        }
    }
Now we add a new parse method that accepts a parser as an argument, and we call that from our new test method:
def parse(p:Parser[Expr], s:String) = {
        val tokens = new lexical.Scanner(s)
        phrase(p)(tokens)
    }

    def test(p:Parser[Expr], exprstr: String) =
        printParseResult(parse(p,exprstr))
We can run the Scala REPL, load our modified file using the ":load" command, then manually call the top-level parser by calling our test method. To reduce typing, we import everything from ExprParser. In the examples below, text in bold is what we type, the rest is printed by the REPL.
scala> import ExprParser._
import ExprParser._

scala> test("1+2")
Tree: EAdd(EConst(1),EConst(2))
Eval: 3

scala> test("1+2*3")
Tree: EAdd(EConst(1),EMul(EConst(2),EConst(3)))
Eval: 7

scala> test("(1+2)*3")
Tree: EMul(EAdd(EConst(1),EConst(2)),EConst(3))
Eval: 9
We can also call the test method that takes a parser as an argument, allowing us to specifically test one particular parsing rule at a time. If we pass in expr as the parser, we will get the same results as above; but if we pass in a different parser, we may get different results.
scala> test(expr,"1+2*3")
Tree: EAdd(EConst(1),EMul(EConst(2),EConst(3)))
Eval: 7

scala> test(binary(1),"1+2*3")
Tree: EAdd(EConst(1),EMul(EConst(2),EConst(3)))
Eval: 7

scala> test(binary(2),"1+2*3")
[1.2] failure: ``/'' expected but `+' found

1+2*3
 ^

scala> test(parens,"1+2")
[1.1] failure: ``('' expected but 1 found

1+2
^

scala> test(parens,"(1+2)")
Tree: EAdd(EConst(1),EConst(2))
Eval: 3

scala> test(parens,"(1+2)*3")
[1.6] failure: end of input expected

(1+2)*3
     ^

Tracing

If you have a larger parser that is not behaving and you are not quite sure where the problem lies, it can be tedious to directly call individual parsers until you find which one is misbehaving. Being able to trace the progress of the whole parser running on an input known to cause the problem might be helpful, but sprinkling println statements throughout your parser can be tricky. This section provides an approach that allows you to do some tracing with minimal changes to your code. The output can get pretty verbose, but at least this will give you a starting point from which you may be able to devise your own improved debugging.

The idea behind this approach is to wrap some or all of the individual parsers in a debugging parser that delegates its apply action to the wrapper parser, but that prints out some debugging information. The apply action is called during the act of parsing.

Note: this code relies on the fact that the code for the various combinators in the Parser class in Scala's StandardTokenParsers (which is implemented as an inner class in scala.util.parsing.combinator.Parsers) does not override any Parser method other than apply.

This code could be added directly to the ExprParser class, but it is presented here as a separate class to make it easier to reuse. Add this DebugStandardTokenParsers class to the file containing ExprParsers.
trait DebugStandardTokenParsers extends StandardTokenParsers {
    class Wrap[+T](name:String,parser:Parser[T]) extends Parser[T] {
        def apply(in: Input): ParseResult[T] = {
            val first = in.first
            val pos = in.pos
            val offset = in.offset
            val t = parser.apply(in)
            println(name+".apply for token "+first+
                    " at position "+pos+" offset "+offset+" returns "+t)
            t
        }
    }
}
The Wrap class provides the hook into the apply method that we need in order to print out our trace information as the parser runs. Once this class is in place, we modify ExprParser to inherit from it rather than from StandardTokenParsers:
object ExprParser extends DebugStandardTokenParsers { ... }
So far we have not changed the behavior of the parser, since we have not yet wired in the Wrap class. To do so, we can take any of the existing parsers and wrap it in a new Wrap. For example, with the top-level expr parser we could do this, with the added code highlighted in bold:
def expr = new Wrap("expr", ( binary(minPrec) | term ) )
We can make this a bit easier to edit and read by using implicits. In DebugStandardTokenParsers we add this method:
implicit def toWrapped(name:String) = new {
        def !!![T](p:Parser[T]) = new Wrap(name,p)
    }
Now we can wrap our expr method like this:
def expr = "expr" !!! ( binary(minPrec) | term )
If you don't like using !!! as an operator, you are free to pick something more to your taste, or you can leave out the implicit and just use the new Wrap approach.

At this point you must modify your source code by adding the above syntax to each parsing rule that you want to trace. You can go through and do them all, or you can just pick out the ones you think are the most likely culprits and wrap those. Note that you can wrap any parser this way, including those that appear as pieces in the middle of other parsers. The following example shows how some of the parsers in the term and binaryOp methods can be wrapped:
    def term = "term" !!! ( value |  "term-parens" !!! parens | unaryMinus )

    def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = {
        level match {
            case 1 =>
                "add" !!! "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
                "sub" !!! "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) }
            case 2 =>
                "mul" !!! "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
                "div" !!! "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) }
            case _ => throw new RuntimeException("bad precedence level "+level)
        }
    }

Assuming we have wrapped the expr, term and binaryOp methods as in the above examples, here is what the output looks like for a few tests. As in the previous REPL example, user input is in bold. If you are using the REPL and reload the file, remember to run import ExprParser._ again to pick up the newer definitions.
scala> test("1")
term.apply for token 1 at position 1.1 offset 0 returns [1.2] parsed: EConst(1)
plus.apply for token EOF at position 1.2 offset 1 returns [1.2] failure: ``+'' expected but EOF found

1
 ^
minus.apply for token EOF at position 1.2 offset 1 returns [1.2] failure: ``-'' expected but EOF found

1
 ^
expr.apply for token 1 at position 1.1 offset 0 returns [1.2] parsed: EConst(1)
Tree: EConst(1)
Eval: 1

scala> test("(1+2)*3")
term.apply for token 1 at position 1.2 offset 1 returns [1.3] parsed: EConst(1)
plus.apply for token `+' at position 1.3 offset 2 returns [1.4] parsed: +
term.apply for token 2 at position 1.4 offset 3 returns [1.5] parsed: EConst(2)
plus.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``+'' expected but `)' found

(1+2)*3
    ^
minus.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``-'' expected but `)' found

(1+2)*3
    ^
expr.apply for token 1 at position 1.2 offset 1 returns [1.5] parsed: EAdd(EConst(1),EConst(2))
term-parens.apply for token `(' at position 1.1 offset 0 returns [1.6] parsed: EAdd(EConst(1),EConst(2))
term.apply for token `(' at position 1.1 offset 0 returns [1.6] parsed: EAdd(EConst(1),EConst(2))
term.apply for token 3 at position 1.7 offset 6 returns [1.8] parsed: EConst(3)
plus.apply for token EOF at position 1.8 offset 7 returns [1.8] failure: ``+'' expected but EOF found

(1+2)*3
       ^
minus.apply for token EOF at position 1.8 offset 7 returns [1.8] failure: ``-'' expected but EOF found

(1+2)*3
       ^
expr.apply for token `(' at position 1.1 offset 0 returns [1.8] parsed: EMul(EAdd(EConst(1),EConst(2)),EConst(3))
Tree: EMul(EAdd(EConst(1),EConst(2)),EConst(3))
Eval: 9

scala> test(parens,"(1+2)")
term.apply for token 1 at position 1.2 offset 1 returns [1.3] parsed: EConst(1)
mul.apply for token `+' at position 1.3 offset 2 returns [1.3] failure: ``*'' expected but `+' found

(1+2)
  ^
div.apply for token `+' at position 1.3 offset 2 returns [1.3] failure: ``/'' expected but `+' found

(1+2)
  ^
add.apply for token `+' at position 1.3 offset 2 returns [1.4] parsed: +
term.apply for token 2 at position 1.4 offset 3 returns [1.5] parsed: EConst(2)
mul.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``*'' expected but `)' found

(1+2)
    ^
div.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``/'' expected but `)' found

(1+2)
    ^
add.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``+'' expected but `)' found

(1+2)
    ^
sub.apply for token `)' at position 1.5 offset 4 returns [1.5] failure: ``-'' expected but `)' found

(1+2)
    ^
expr.apply for token 1 at position 1.2 offset 1 returns [1.5] parsed: EAdd(EConst(1),EConst(2))
Tree: EAdd(EConst(1),EConst(2))
Eval: 3
As you can see, even for these very short input strings the output is pretty verbose. It does, however, show you what token it is trying to parse and where in the input stream that token is, so by paying attention to the position and offset numbers you can see where it is backtracking.

When you have found the problem and are done debugging, you can remove the DebugStandardTokenParsers class and take out all of the !!! wrapping operations, or you can leave everything in place and disable the wrapper output by changing the definition of the implicit !!! operator to this:
def !!![T](p:Parser[T]) = p
Or, if you want to make it possible to enable debugging output later, change !!! to return either p or new Wrap(p) depending on some debugging configuration value.

Updated Example

Below is the complete program with all of the above changes.
import scala.util.parsing.combinator.syntactical.StandardTokenParsers

sealed abstract class Expr {
    def eval():Int
}

case class EConst(value:Int) extends Expr {
    def eval():Int = value
}

case class EAdd(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval + right.eval
}

case class ESub(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval - right.eval
}

case class EMul(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval * right.eval
}

case class EDiv(left:Expr, right:Expr) extends Expr {
    def eval():Int = left.eval / right.eval
}

case class EUMinus(e:Expr) extends Expr {
    def eval():Int = -e.eval
}

trait DebugStandardTokenParsers extends StandardTokenParsers {
    class Wrap[+T](name:String,parser:Parser[T]) extends Parser[T] {
        def apply(in: Input): ParseResult[T] = {
            val first = in.first
            val pos = in.pos
            val offset = in.offset
            val t = parser.apply(in)
            println(name+".apply for token "+first+
                    " at position "+pos+" offset "+offset+" returns "+t)
            t
        }
    }

    implicit def toWrapped(name:String) = new {
        def !!![T](p:Parser[T]) = new Wrap(name,p) //for debugging
        //def !!![T](p:Parser[T]) = p              //for production
    }
}

object ExprParser extends DebugStandardTokenParsers {
    lexical.delimiters ++= List("+","-","*","/","(",")")

    def value = numericLit ^^ { s => EConst(s.toInt) }

    def parens:Parser[Expr] = "(" ~> expr <~ ")"

    def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) }

    def term = "term" !!! ( value |  "term-parens" !!! parens | unaryMinus )

    def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = {
        level match {
            case 1 =>
                "add" !!! "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } |
                "sub" !!! "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) }
            case 2 =>
                "mul" !!! "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } |
                "div" !!! "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) }
            case _ => throw new RuntimeException("bad precedence level "+level)
        }
    }
    val minPrec = 1
    val maxPrec = 2

    def binary(level:Int):Parser[Expr] =
        if (level>maxPrec) term
        else binary(level+1) * binaryOp(level)

    def expr = "expr" !!! ( binary(minPrec) | term )

    def parse(s:String) = {
        val tokens = new lexical.Scanner(s)
        phrase(expr)(tokens)
    }

    def parse(p:Parser[Expr], s:String) = {
        val tokens = new lexical.Scanner(s)
        phrase(p)(tokens)
    }

    def apply(s:String):Expr = {
        parse(s) match {
            case Success(tree, _) => tree
            case e: NoSuccess =>
                   throw new IllegalArgumentException("Bad syntax: "+s)
        }
    }

    def test(exprstr: String) =
        printParseResult(parse(exprstr))

    def test(p:Parser[Expr], exprstr: String) =
        printParseResult(parse(p,exprstr))

    def printParseResult(pr:ParseResult[Expr]) = {
        pr match {
            case Success(tree, _) =>
                println("Tree: "+tree)
                val v = tree.eval()
                println("Eval: "+v)
            case e: NoSuccess => Console.err.println(e)
        }
    }
    
    //A main method for testing
    def main(args: Array[String]) = test(args(0))
}

5 comments:

Dobes said...

Thanks. Actually the fact that the toString() method on failure outputs something useful was very good. You could consider changing the parse method so that when it throws IllegalArgumentException it includes e.toString in the message.

Dobes said...

Turns out my issue was related to using rep1sep - a warning for those that follow is that rep1sep will still match even if the separator is not present at all, it just returns a 1-element list.

Thus if you are using rep1sep with | to provide alternatives your alternatives will never match.

i.e. rep1sep(term, "*") ... | rep1sep(term, "/") will never check for a "/" if it finds a single term.

The solution is to make your own version of rep1sep that requires at least one instance of the separator and put term as the fallback.

TechNeilogy said...

Thanks. I'm just learning Scala and one of the first things I want to do is port an existing parser I wrote in F#. This blog post should help a lot.

Vassil Dichev said...

That's a thoughtful post and you've apparently spent quite some time writing Scala parsers, but I'm surprised you haven't mentioned the "log" parser- its output has helped me on not one occasion:

http://stackoverflow.com/questions/2387892/parser-combinator-not-terminating-how-to-log-what-is-going-on

Jim McBeath said...

Vassil Dichev: An embarrassing oversight on my part! Thanks for pointing it out.

Readers, if you are happy with the format of the standard log parser, you can throw out the entire DebugStandardTokenParsers class and just add this little method to your parser class (ExprParser in my example) to enable the use of the !!! method:

implicit def toLogged(name:String) = new {
   def !!![T](p:Parser[T]) = log(p)(name)
}

My approach with DebugStandardTokenParsers allows for a bit of customization of the logging output, but in most situations I suspect that would not be necessary, and the much simpler expedient of adding the above implicit (or just using the log parser directly) would be satisfactory and far simpler.