Friday, September 26, 2008

Scala Parser Combinators

One of the reasons I chose to do my StringArt applet as my first Scala applet was because Scala includes a nifty capability called parser combinators.

A parser is a function that converts a stream of input tokens into a data structure that can be more easily digested by the application. A combinator is a way of combing two functions to produce a composed function. A parser combinator is a way of combining two parsers to produce another parser.

The Parser classes in Scala include methods to combine Parsers in various ways and to add functions to be executed when a parse succeeds, to build the resulting datastructure.

Parser combinators simplify the creation of some parsers, including the simple expression parsing I wanted. When using Scala, the specification of the syntax to be parsed, which might normally be done in a separate file which is read by a separate parser (as when using a parser generator such as yacc), can be done directly in a Scala source file.

Scala's syntax has two features that make it possible to write a syntax specification directly in Scala that still looks a lot like a BNF description.
  1. (Almost) any operator string can be used as a method name.
  2. A method that takes a single argument can be written without parentheses.
The parser combinator class is designed to be able to generate an abstract syntax tree (AST) representing the parse, which can then be easily manipulated by the application. In the following example, we create a parser that builds an AST which we then evaluate to produce a value.

Contents

Basic Parser

We start with a simple Expr base class for our AST elements, with case classes for constant values and an add operator. The base Expr class is a sealed class, so all of its case classes, including whatever we add below, must go into the same source file as the base class.
sealed abstract class Expr { def eval():Double } case class EConst(value:Double) extends Expr { def eval():Double = value } case class EAdd(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval + right.eval }
Now we create a simple parser that understands expressions consisting either of a single integer or two integers with a plus between them.
import scala.util.parsing.combinator.syntactical._ object ExprParser extends StandardTokenParsers { lexical.delimiters ++= List("+") def value = numericLit ^^ { s => EConst(s.toDouble) } def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right => EAdd(left, right) } def expr = ( sum | value ) //top level expression def parse(s:String) = { val tokens = new lexical.Scanner(s) phrase(expr)(tokens) } def apply(s:String):Expr = { parse(s) match { case Success(tree, _) => tree case e: NoSuccess => throw new IllegalArgumentException("Bad syntax: "+s) } } //Simplify testing def test(exprstr: String) = { parse(exprstr) match { case Success(tree, _) => println("Tree: "+tree) val v = tree.eval() println("Eval: "+v) case e: NoSuccess => Console.err.println(e) } } //A main method for testing def main(args: Array[String]) = test(args(0)) }
The parse, apply, test and main methods are all helper methods so that we can run our parser and see what it is doing. So far our actual parser consists only of the value, sum and expr definitions.

In the above code, the tilda character ~ is a Parser method that composes two parsers into another parser that sequentially calls the two composed parsers, and the vertical bar character | creates a parser which calls the second parser only if the first parser fails. The double carat ^^ passes the parsed value to a function that builds the AST element. When we want to match against a parse sequence and extract more than one value, we can use a case statement after the ^^. That case pattern should match the pattern of the parsers (such as can be seen in the sum parser and its case pattern).

Strings in the middle of these sequences of parsers, such as "+" in the sum parser, are implicitly converted into keyword parsers that accept only that literal string.

You can cut and paste the above two code fragments into a file, load it into the Scala interpreter with the :load command, and test it by entering commands such as the following:
ExprParser.test("123") ExprParser.test("12 + 34")

Repeated Terms

Our parser only handles additions with two terms. In order to handle additions with three or more terms, we redefine ExprParser.sum using the repsep method:
def sum = repsep(value,"+") ^^ { a:List[Expr] => ESum(a) }
We also add the ESum case class:
case class ESum(a:List[Expr]) extends Expr { def eval():Double = a.foldLeft(0.0)(_ + _.eval) }
In order to add a subtract operator, we once again modify our ExprParser.sum method, this time using the * operator. We also add our subtract operator - to the list of delimiters:
lexical.delimiters ++= List("+", "-") ... def sum = value * ( "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } )
Note that we are using a triple carat ^^^ here rather than the double carat ^^ that we used earlier. The triple carat operator does not use any data from the parser in the return value. In our use, we are returning a function which will be used by the * operator to collapse the values in the list.

We also add the ESub case class:
case class ESub(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval - right.eval }
Now we can enter expressions like these:
ExprParser.test("2-1") ExprParser.test("1+2+3-4")

Precedence

Next we add multiply and divide operators. As when we added the subtract operator, we add the * and / delimiters to ExprParser.lexical.delimiters and we add case classes EMul and EDiv:
case class EMul(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval * right.eval } case class EDiv(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval / right.eval } class ExprParser { .... lexical.delimiters ++= List("+", "-", "*", "/") .... }
However, since multiplication and division have higher precedence than add and subtract, we can't just add those operators to the sum parser like we added subtract. Instead, we create a product parser to handle the two new operators. We insert this logically between the sum and value parsers:
class ExprParser { .... def product = value * ( "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } ) def sum = product * ( ... .... }
We can now enter expressions containing multiplication as well as addition and have them properly evaluated:
ExprParser.test("3+4*5")

Parentheses

We would like to be able to use parentheses, so we add those two characters to ExprParser.lexical.delimiters. We add a parens parser, we add a term parser that accepts either a number or a value in parentheses, and we modify the product parser to call for a term rather than a value:
class ExprParser { ... lexical.delimiters ++= List("+", "-", "*", "/", "(", ")") ... def parens:Parser[Expr] = "(" ~> expr <~ ")" def term = ( value | parens ) def product = term * ... def expr = ( sum | term ) ... }
We don't need to provide a value expression for the term parser because the default value for the | operator is the value of whichever expression was parsed, which is what we want. We don't need to provide a value expression for the parens parser because the ~> operator throws away the value on the left and the <~ operator throws away the value on the right, so we are left with the value of the expr in the middle.

Now we can enter expressions with parentheses when we want to change the order of operations:
ExprParser.test("(3+4)*5")

Unary Operators

We want to be able to specify negative numbers. We can't just make a leading negative sign an optional part of our numericLit token because then we can't parse a string such as 2-1. Instead, we implement a unary minus operation with a precendence higher than any of our binary operators.

The minus character is already in our set of delimiters, so we don't need to change that list. We add a EUMinus case class, add a unaryMinus parser definition, and modify the term parser to include the unaryMinus parser:
case class EUMinus(e:Expr) extends Expr { def eval():Double = -e.eval } class ExprParser { ... def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) } def term = ( value | parens | unaryMinus) ... }
We need to specify the return type for unaryMinus because it is recursive (through the call to term), so Scala does not automatically infer its type.

Now we can parse expressions that include unary minus:
ExprParser.test("3*-(2+2)")

Lexical

The standard Scala lexical analyzer implementation of numericLit only parses integers, but we want floating point notation, so we extend the standard lexical class to do that. Note the first line of the class, where it overrides the definition of the token method to be either our own floatingToken or a token from our superclass. The rest of ExprLexical is the definition of the floatingToken token, complete with optional signed exponent.
import scala.util.parsing.combinator.lexical.StdLexical class ExprLexical extends StdLexical { override def token: Parser[Token] = floatingToken | super.token def floatingToken: Parser[Token] = rep1(digit) ~ optFraction ~ optExponent ^^ { case intPart ~ frac ~ exp => NumericLit( (intPart mkString "") :: frac :: exp :: Nil mkString "")} def chr(c:Char) = elem("", ch => ch==c ) def sign = chr('+') | chr('-') def optSign = opt(sign) ^^ { case None => "" case Some(sign) => sign } def fraction = '.' ~ rep(digit) ^^ { case dot ~ ff => dot :: (ff mkString "") :: Nil mkString "" } def optFraction = opt(fraction) ^^ { case None => "" case Some(fraction) => fraction } def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ { case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString "" } def optExponent = opt(exponent) ^^ { case None => "" case Some(exponent) => exponent } }
We then modify our ExprParser class to use our own lexical parser so that we can parse floats or ints:
object ExprParser extends StandardTokenParsers { override val lexical = new ExprLexical ... }
Now we can enter expressions with either integer or floating point numbers:
ExprParser.test("123.456") ExprParser.test("1.2+3") ExprParser.test("5e-1 + 2")

Precedence Revisited

If we wanted to add some more binary operators at a different level of precedence (such as the integer shift operators << and >>), we could handle that the same way as we did when we added multiply and divide, by adding a new rule and modifying the existing rules to insert it at the right location. It would be nicer if we could factor out the precedence rules so that we don't need multiple binary operator rules. Scala's parser classes do not directly support precedence rules, but we can add them relatively easily.

We replace the product and sum parsers with a binary parser and we call that instead of sum from our expr parser. We factor out the binary operations into a binaryOp method where for each precedence level we list each binary operator and its corresponding AST node. The binary parser takes an argument which is the current precedence level.
class ExprParser { ... def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = { level match { case 1 => "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } case 2 => "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } case _ => throw new RuntimeException("bad precedence level "+level) } } val minPrec = 1 val maxPrec = 2 def binary(level:Int):Parser[Expr] = if (level>maxPrec) term else binary(level+1) * binaryOp(level) def expr = ( binary(minPrec) | term ) ... }
Now if we want to add binary operators at a new precedence level, we only need to modify the binaryOp method, update the value of maxPrec, and add our case classes.

Wrap Up

Below is the complete example in its final form. You should be able to copy and paste this one code fragment and have a working four-function floating point expression parser.
sealed abstract class Expr { def eval():Double } case class EConst(value:Double) extends Expr { def eval():Double = value } case class EAdd(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval + right.eval } case class ESub(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval - right.eval } case class EMul(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval * right.eval } case class EDiv(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval / right.eval } case class EUMinus(e:Expr) extends Expr { def eval():Double = -e.eval } import scala.util.parsing.combinator.lexical.StdLexical class ExprLexical extends StdLexical { override def token: Parser[Token] = floatingToken | super.token def floatingToken: Parser[Token] = rep1(digit) ~ optFraction ~ optExponent ^^ { case intPart ~ frac ~ exp => NumericLit( (intPart mkString "") :: frac :: exp :: Nil mkString "")} def chr(c:Char) = elem("", ch => ch==c ) def sign = chr('+') | chr('-') def optSign = opt(sign) ^^ { case None => "" case Some(sign) => sign } def fraction = '.' ~ rep(digit) ^^ { case dot ~ ff => dot :: (ff mkString "") :: Nil mkString "" } def optFraction = opt(fraction) ^^ { case None => "" case Some(fraction) => fraction } def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ { case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString "" } def optExponent = opt(exponent) ^^ { case None => "" case Some(exponent) => exponent } } import scala.util.parsing.combinator.syntactical._ object ExprParser extends StandardTokenParsers { override val lexical = new ExprLexical lexical.delimiters ++= List("+","-","*","/","(",")") def value = numericLit ^^ { s => EConst(s.toDouble) } def parens:Parser[Expr] = "(" ~> expr <~ ")" def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) } def term = ( value | parens | unaryMinus ) def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = { level match { case 1 => "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } case 2 => "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } case _ => throw new RuntimeException("bad precedence level "+level) } } val minPrec = 1 val maxPrec = 2 def binary(level:Int):Parser[Expr] = if (level>maxPrec) term else binary(level+1) * binaryOp(level) def expr = ( binary(minPrec) | term ) def parse(s:String) = { val tokens = new lexical.Scanner(s) phrase(expr)(tokens) } def apply(s:String):Expr = { parse(s) match { case Success(tree, _) => tree case e: NoSuccess => throw new IllegalArgumentException("Bad syntax: "+s) } } def test(exprstr: String) = { parse(exprstr) match { case Success(tree, _) => println("Tree: "+tree) val v = tree.eval() println("Eval: "+v) case e: NoSuccess => Console.err.println(e) } } //A main method for testing def main(args: Array[String]) = test(args(0)) }
In my StringArt applet I also implemented variables and function calls, but those did not demonstrate any additional features of Scala's parser combinators, so I have left that as an exercise for the reader.

For additional reading check out these other parser combinator pages:
  • Debasish Ghosh blog entry in which he developers a parser for a stock (shares) trading language that looks very much like written English.
  • Daniel Spiewak blog entry in which he builds a parser for a simple language. Look in the comments for a link to an on-line version of Wadler's 1985 paper.
  • Matt Malone blog entry in which be builds a parser for the Subversion client-server message protocol syntax, using lots of RegexParsers.
  • Part 4 of Stefan Zeiger blog on parser combinators, where he creates a parser for interger expressions using * and +. The three previous parts deal more generally with parser combinators, with the promise of additional parts going into more detail about Scala parser combinators.
  • Philip Wadler's 1985 paper on parser combinators.
  • Adriaan Moors's Parsing in Scala paper. One of Adriaan's examples is a parser that handles the four basic arithmetic functions. In that example, the evaluation of the expression is done by the parser, whereas in my example I create an AST for later evaluation. I did this because in my real usage, I evaluate the expression multiple times, each time with different values for the variables which can be referenced by the expression.
  • The "Programming in Scala" book has a chapter on Parser Combinators.
  • The scaladoc page for the Parsers.Parser class in Scala lists the parser operators, and the scaladoc for the Parsers trait lists the parser method names.
  • This Parser combinator in Java was done by using parser combinators from Haskell and converting them to Java.
  • A 2001 paper about the Haskell Parsec parser.

13 comments:

Channing Walton said...

Thanks for this Jim, this is the clearest explanation on Scala's parser combinators I have found.

Unknown said...

Hi,

The entry seems simple enough, but unfortunately it doesn't work for me.

I have this code:

object ArithmeticParser extends
StdTokenParsers {
type Tokens = StdLexical
val lexical = new StdLexical

lexical.delimiters ++= List("+")

def value = numericLit ^^ { s => EConst(s.toDouble) }


def main(args: Array[String]) = {
Console.println(value(new lexical.Scanner("2")))
}
}

(and Expr, EConst, EAdd)

This works.

Now I try to add the "sum" definition, i.e. these lines:

def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right =>
EAdd(left, right) }


I get three compile errors:

ArithmeticParser.scala:14: error: not found: value ~
def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right =>

^
ArithmeticParser.scala:15: error: not found: value left
EAdd(left, right) }

^
ArithmeticParser.scala:15: error: not found: value right
EAdd(left, right) }

^

I have no clue why this doesn't work. Any idea?

Thanks.

Jim McBeath said...

Dimitris: I tried your code and it worked fine for me. I added your sum method just after the value method. It compiled without error, I changed the line in your main to use sum instead of value and 2+3 instead of 2, and it gave me a proper parse containing EAdd with nested EConst values of 2.0 and 3.0. I tested this using 2.7.3.final, with the code in a file loaded with the :load command at the REPL prompt.

Unknown said...

Thanks! (hmm, I hoped I would get an email when you responded, strange). I was mistakenly using Scala 2.6.0 (tried maven for the first time :P) and created a nice confusion for myself :)

Dobes said...

The scaladoc links in this article seem to be broken now :-(

Why did you get rid of repsep()? I shows up once and then disappears without explanation.

Dobes said...

Could you explain more about ^^^ ? The scaladoc I am looking at doesn't say anything at all about it:

http://www.scala-lang.org/api/beta/scala/util/parsing/combinator/Parsers$Parser.html

Dobes said...

I think you can replace something like:

"+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) }

With simply:

"+" ^^^ EAdd

... I haven't tried it yet, but I think it'll work and might be simpler. Or am I missing something?

Jim McBeath said...

Dobes: The scaladoc location has changed since I posted this. It should be pretty easy to find in the API documentation at scala-lang.org.

In the post, when I added the subtraction operator I changed from using repsep to using * because there were now two possible separators (+ or -), and I needed to be able to create different parse trees based on that.

The ^^^ operator throws away the result of the parse and replaces it with the specified value. In this case, the parsed item is the operator character (+ or -), which is discarded and replaced by the corresponding operator function (EAdd or ESub). If you use the "current" rather than "beta" documentation, you should see a bit more info for the ^^^ operator.

Yes, you can replace "(a:Expr, b:Expr) => EAdd(a,b)" in this situation by just "EAdd" and it will still compile and work the same way, but perhaps not for the reason you might expect. If EAdd were a function, then you would have to write "EAdd _" to make that work; but because EAdd is a case class, it happens to work when supplying only the case class name. If you want something with the same type signature here, you would use "EAdd.apply _".

Dobes said...

Any tips on how to debug your grammars? My grammar isn't working and I can't figure out why. Doesn't seem to be any tooling to trace parsing activity or anything nice like that.

Jim McBeath said...

Dobes: In response to your question I have written a post with a couple of techniques you might be able to use.

agrin said...

Thanks Jim you explained great. I have created my own parser for identifying path but I have a problem with that would you please help me to solve it. Here is a link to the problem: http://stackoverflow.com/questions/30902235/how-to-solve-an-error-related-to-creating-parser-from-regex

Unknown said...

Good Morning,

I want to use [ and ] too to parse, for example "[]" or "[Int]" to get EmptyList() or List(TypeInt()) but when I add "[" and "]" to the delimiters every time that I write for example "[]" in the entrance I get illegal token error, so I think that I need to add some other thing but I don't know what...

My delimiter list is:

lexical1.delimiters ++= List("λ", ".", ",","[","]","(",")","\\","|","/","=")

And the rule, that I think that's correct is:

lazy val list: P1[List] = "[" ~> type <~ "]" ^^ {case x => List(x)}

Could you help me please?

Thanks in advance

Jim McBeath said...

Oriol: I can't tell without seeing more of your code, but I suspect the problem is that your list expression requires a type between the square brackets, so that "[]" really is an illegal input because it does not specify a type. One of the disadvantages of using these parser combinators is that the error messages are not always very helpful.

You might want to check out my later post on debugging Scala parser combinators (and be sure to read the comments there).