A parser is a function that converts a stream of input tokens into a data structure that can be more easily digested by the application. A combinator is a way of combing two functions to produce a composed function. A parser combinator is a way of combining two parsers to produce another parser.
The Parser classes in Scala include methods to combine Parsers in various ways and to add functions to be executed when a parse succeeds, to build the resulting datastructure.
Parser combinators simplify the creation of some parsers, including the simple expression parsing I wanted. When using Scala, the specification of the syntax to be parsed, which might normally be done in a separate file which is read by a separate parser (as when using a parser generator such as yacc), can be done directly in a Scala source file.
Scala's syntax has two features that make it possible to write a syntax specification directly in Scala that still looks a lot like a BNF description.
- (Almost) any operator string can be used as a method name.
- A method that takes a single argument can be written without parentheses.
Contents
Basic Parser
We start with a simpleExpr
base class for our AST elements,
with case classes for constant values and an add operator.
The base Expr
class is a sealed class, so all of its
case classes, including whatever we add below, must go into the same
source file as the base class.
Now we create a simple parser that understands expressions consisting either of a single integer or two integers with a plus between them.sealed abstract class Expr { def eval():Double } case class EConst(value:Double) extends Expr { def eval():Double = value } case class EAdd(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval + right.eval }
Theimport scala.util.parsing.combinator.syntactical._ object ExprParser extends StandardTokenParsers { lexical.delimiters ++= List("+") def value = numericLit ^^ { s => EConst(s.toDouble) } def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right => EAdd(left, right) } def expr = ( sum | value ) //top level expression def parse(s:String) = { val tokens = new lexical.Scanner(s) phrase(expr)(tokens) } def apply(s:String):Expr = { parse(s) match { case Success(tree, _) => tree case e: NoSuccess => throw new IllegalArgumentException("Bad syntax: "+s) } } //Simplify testing def test(exprstr: String) = { parse(exprstr) match { case Success(tree, _) => println("Tree: "+tree) val v = tree.eval() println("Eval: "+v) case e: NoSuccess => Console.err.println(e) } } //A main method for testing def main(args: Array[String]) = test(args(0)) }
parse
, apply
, test
and
main
methods are all helper methods so that we can run our
parser and see what it is doing.
So far our actual parser consists only of the
value
, sum
and expr
definitions.
In the above code, the tilda character
~
is a
Parser
method that composes two parsers into another parser
that sequentially calls the two composed parsers,
and the vertical bar character |
creates a parser which
calls the second parser only if the first parser fails.
The double carat ^^
passes the parsed value to a
function that
builds the AST element.
When we want to match against a parse sequence and extract more than
one value, we can use a case statement after the ^^
.
That case pattern should match the pattern of the parsers
(such as can be seen in the sum
parser and its case pattern).
Strings in the middle of these sequences of parsers, such as
"+"
in the sum
parser, are implicitly converted into
keyword
parsers that accept only that literal string.
You can cut and paste the above two code fragments into a file, load it into the Scala interpreter with the
:load
command, and test it by entering
commands such as the following:
ExprParser.test("123") ExprParser.test("12 + 34")
Repeated Terms
Our parser only handles additions with two terms. In order to handle additions with three or more terms, we redefineExprParser.sum
using the repsep
method:
We also add thedef sum = repsep(value,"+") ^^ { a:List[Expr] => ESum(a) }
ESum
case class:
In order to add a subtract operator, we once again modify ourcase class ESum(a:List[Expr]) extends Expr { def eval():Double = a.foldLeft(0.0)(_ + _.eval) }
ExprParser.sum
method, this time using the *
operator.
We also add our subtract operator -
to the list of
delimiters:
Note that we are using a triple caratlexical.delimiters ++= List("+", "-") ... def sum = value * ( "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } )
^^^
here rather than
the double carat ^^
that we used earlier.
The triple carat operator does not use any data from the parser
in the return value.
In our use, we are returning a function which will be used by
the *
operator to collapse the values in the list.
We also add the
ESub
case class:
Now we can enter expressions like these:case class ESub(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval - right.eval }
ExprParser.test("2-1") ExprParser.test("1+2+3-4")
Precedence
Next we add multiply and divide operators. As when we added the subtract operator, we add the*
and /
delimiters to ExprParser.lexical.delimiters
and we add case classes EMul
and EDiv
:
However, since multiplication and division have higher precedence than add and subtract, we can't just add those operators to thecase class EMul(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval * right.eval } case class EDiv(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval / right.eval } class ExprParser { .... lexical.delimiters ++= List("+", "-", "*", "/") .... }
sum
parser
like we added subtract.
Instead, we create a product
parser to handle the
two new operators.
We insert this logically between the sum
and value
parsers:
We can now enter expressions containing multiplication as well as addition and have them properly evaluated:class ExprParser { .... def product = value * ( "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } ) def sum = product * ( ... .... }
ExprParser.test("3+4*5")
Parentheses
We would like to be able to use parentheses, so we add those two characters toExprParser.lexical.delimiters
.
We add a parens
parser, we add a term
parser that accepts either a number or a value in parentheses,
and we modify the product
parser to call for a
term
rather than a value
:
We don't need to provide a value expression for theclass ExprParser { ... lexical.delimiters ++= List("+", "-", "*", "/", "(", ")") ... def parens:Parser[Expr] = "(" ~> expr <~ ")" def term = ( value | parens ) def product = term * ... def expr = ( sum | term ) ... }
term
parser because the default value for the |
operator is
the value of whichever expression was parsed, which is what we want.
We don't need to provide a value expression for the parens
parser because the ~>
operator throws away the value
on the left and the <~
operator throws away the value
on the right, so we are left with the value of the expr
in the middle.
Now we can enter expressions with parentheses when we want to change the order of operations:
ExprParser.test("(3+4)*5")
Unary Operators
We want to be able to specify negative numbers. We can't just make a leading negative sign an optional part of ournumericLit
token because then we can't parse a string
such as 2-1
.
Instead, we implement a unary minus operation with a precendence higher
than any of our binary operators.
The minus character is already in our set of delimiters, so we don't need to change that list. We add a
EUMinus
case class, add a unaryMinus
parser definition, and modify the term
parser to
include the unaryMinus
parser:
We need to specify the return type forcase class EUMinus(e:Expr) extends Expr { def eval():Double = -e.eval } class ExprParser { ... def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) } def term = ( value | parens | unaryMinus) ... }
unaryMinus
because
it is recursive (through the call to term
),
so Scala does not automatically infer its type.
Now we can parse expressions that include unary minus:
ExprParser.test("3*-(2+2)")
Lexical
The standard Scala lexical analyzer implementation ofnumericLit
only parses integers, but we want floating point notation, so we
extend the standard lexical class to do that.
Note the first line of the class, where it overrides the definition
of the token
method to be either our own
floatingToken
or a token from our superclass.
The rest of ExprLexical
is the definition of the
floatingToken
token, complete with optional signed exponent.
We then modify ourimport scala.util.parsing.combinator.lexical.StdLexical class ExprLexical extends StdLexical { override def token: Parser[Token] = floatingToken | super.token def floatingToken: Parser[Token] = rep1(digit) ~ optFraction ~ optExponent ^^ { case intPart ~ frac ~ exp => NumericLit( (intPart mkString "") :: frac :: exp :: Nil mkString "")} def chr(c:Char) = elem("", ch => ch==c ) def sign = chr('+') | chr('-') def optSign = opt(sign) ^^ { case None => "" case Some(sign) => sign } def fraction = '.' ~ rep(digit) ^^ { case dot ~ ff => dot :: (ff mkString "") :: Nil mkString "" } def optFraction = opt(fraction) ^^ { case None => "" case Some(fraction) => fraction } def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ { case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString "" } def optExponent = opt(exponent) ^^ { case None => "" case Some(exponent) => exponent } }
ExprParser
class to use our own lexical
parser so that we can parse floats or ints:
Now we can enter expressions with either integer or floating point numbers:object ExprParser extends StandardTokenParsers { override val lexical = new ExprLexical ... }
ExprParser.test("123.456") ExprParser.test("1.2+3") ExprParser.test("5e-1 + 2")
Precedence Revisited
If we wanted to add some more binary operators at a different level of precedence (such as the integer shift operators<<
and >>
),
we could handle that the same way as we did when we added
multiply and divide, by adding a new rule and modifying the
existing rules to insert it at the right location.
It would be nicer if we could factor out the precedence rules so
that we don't need multiple binary operator rules.
Scala's parser classes do not directly support precedence rules,
but we can add them relatively easily.
We replace the
product
and sum
parsers
with a binary
parser and we call that instead of sum
from our expr
parser.
We factor out the binary operations into a binaryOp
method
where for each precedence level
we list each binary operator and its corresponding AST node.
The binary
parser takes an argument which is the
current precedence level.
Now if we want to add binary operators at a new precedence level, we only need to modify theclass ExprParser { ... def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = { level match { case 1 => "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } case 2 => "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } case _ => throw new RuntimeException("bad precedence level "+level) } } val minPrec = 1 val maxPrec = 2 def binary(level:Int):Parser[Expr] = if (level>maxPrec) term else binary(level+1) * binaryOp(level) def expr = ( binary(minPrec) | term ) ... }
binaryOp
method,
update the value of maxPrec
,
and add our case classes.
Wrap Up
Below is the complete example in its final form. You should be able to copy and paste this one code fragment and have a working four-function floating point expression parser.In my StringArt applet I also implemented variables and function calls, but those did not demonstrate any additional features of Scala's parser combinators, so I have left that as an exercise for the reader.sealed abstract class Expr { def eval():Double } case class EConst(value:Double) extends Expr { def eval():Double = value } case class EAdd(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval + right.eval } case class ESub(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval - right.eval } case class EMul(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval * right.eval } case class EDiv(left:Expr, right:Expr) extends Expr { def eval():Double = left.eval / right.eval } case class EUMinus(e:Expr) extends Expr { def eval():Double = -e.eval } import scala.util.parsing.combinator.lexical.StdLexical class ExprLexical extends StdLexical { override def token: Parser[Token] = floatingToken | super.token def floatingToken: Parser[Token] = rep1(digit) ~ optFraction ~ optExponent ^^ { case intPart ~ frac ~ exp => NumericLit( (intPart mkString "") :: frac :: exp :: Nil mkString "")} def chr(c:Char) = elem("", ch => ch==c ) def sign = chr('+') | chr('-') def optSign = opt(sign) ^^ { case None => "" case Some(sign) => sign } def fraction = '.' ~ rep(digit) ^^ { case dot ~ ff => dot :: (ff mkString "") :: Nil mkString "" } def optFraction = opt(fraction) ^^ { case None => "" case Some(fraction) => fraction } def exponent = (chr('e') | chr('E')) ~ optSign ~ rep1(digit) ^^ { case e ~ optSign ~ exp => e :: optSign :: (exp mkString "") :: Nil mkString "" } def optExponent = opt(exponent) ^^ { case None => "" case Some(exponent) => exponent } } import scala.util.parsing.combinator.syntactical._ object ExprParser extends StandardTokenParsers { override val lexical = new ExprLexical lexical.delimiters ++= List("+","-","*","/","(",")") def value = numericLit ^^ { s => EConst(s.toDouble) } def parens:Parser[Expr] = "(" ~> expr <~ ")" def unaryMinus:Parser[EUMinus] = "-" ~> term ^^ { EUMinus(_) } def term = ( value | parens | unaryMinus ) def binaryOp(level:Int):Parser[((Expr,Expr)=>Expr)] = { level match { case 1 => "+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) } | "-" ^^^ { (a:Expr, b:Expr) => ESub(a,b) } case 2 => "*" ^^^ { (a:Expr, b:Expr) => EMul(a,b) } | "/" ^^^ { (a:Expr, b:Expr) => EDiv(a,b) } case _ => throw new RuntimeException("bad precedence level "+level) } } val minPrec = 1 val maxPrec = 2 def binary(level:Int):Parser[Expr] = if (level>maxPrec) term else binary(level+1) * binaryOp(level) def expr = ( binary(minPrec) | term ) def parse(s:String) = { val tokens = new lexical.Scanner(s) phrase(expr)(tokens) } def apply(s:String):Expr = { parse(s) match { case Success(tree, _) => tree case e: NoSuccess => throw new IllegalArgumentException("Bad syntax: "+s) } } def test(exprstr: String) = { parse(exprstr) match { case Success(tree, _) => println("Tree: "+tree) val v = tree.eval() println("Eval: "+v) case e: NoSuccess => Console.err.println(e) } } //A main method for testing def main(args: Array[String]) = test(args(0)) }
For additional reading check out these other parser combinator pages:
- Debasish Ghosh blog entry in which he developers a parser for a stock (shares) trading language that looks very much like written English.
- Daniel Spiewak blog entry in which he builds a parser for a simple language. Look in the comments for a link to an on-line version of Wadler's 1985 paper.
-
Matt Malone blog entry in which be builds a parser for the
Subversion client-server message protocol syntax,
using lots of
RegexParsers
. - Part 4 of Stefan Zeiger blog on parser combinators, where he creates a parser for interger expressions using * and +. The three previous parts deal more generally with parser combinators, with the promise of additional parts going into more detail about Scala parser combinators.
- Philip Wadler's 1985 paper on parser combinators.
- Adriaan Moors's Parsing in Scala paper. One of Adriaan's examples is a parser that handles the four basic arithmetic functions. In that example, the evaluation of the expression is done by the parser, whereas in my example I create an AST for later evaluation. I did this because in my real usage, I evaluate the expression multiple times, each time with different values for the variables which can be referenced by the expression.
- The "Programming in Scala" book has a chapter on Parser Combinators.
- The scaladoc page for the Parsers.Parser class in Scala lists the parser operators, and the scaladoc for the Parsers trait lists the parser method names.
- This Parser combinator in Java was done by using parser combinators from Haskell and converting them to Java.
- A 2001 paper about the Haskell Parsec parser.
13 comments:
Thanks for this Jim, this is the clearest explanation on Scala's parser combinators I have found.
Hi,
The entry seems simple enough, but unfortunately it doesn't work for me.
I have this code:
object ArithmeticParser extends
StdTokenParsers {
type Tokens = StdLexical
val lexical = new StdLexical
lexical.delimiters ++= List("+")
def value = numericLit ^^ { s => EConst(s.toDouble) }
def main(args: Array[String]) = {
Console.println(value(new lexical.Scanner("2")))
}
}
(and Expr, EConst, EAdd)
This works.
Now I try to add the "sum" definition, i.e. these lines:
def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right =>
EAdd(left, right) }
I get three compile errors:
ArithmeticParser.scala:14: error: not found: value ~
def sum = value ~ "+" ~ value ^^ { case left ~ "+" ~ right =>
^
ArithmeticParser.scala:15: error: not found: value left
EAdd(left, right) }
^
ArithmeticParser.scala:15: error: not found: value right
EAdd(left, right) }
^
I have no clue why this doesn't work. Any idea?
Thanks.
Dimitris: I tried your code and it worked fine for me. I added your sum method just after the value method. It compiled without error, I changed the line in your main to use sum instead of value and 2+3 instead of 2, and it gave me a proper parse containing EAdd with nested EConst values of 2.0 and 3.0. I tested this using 2.7.3.final, with the code in a file loaded with the :load command at the REPL prompt.
Thanks! (hmm, I hoped I would get an email when you responded, strange). I was mistakenly using Scala 2.6.0 (tried maven for the first time :P) and created a nice confusion for myself :)
The scaladoc links in this article seem to be broken now :-(
Why did you get rid of repsep()? I shows up once and then disappears without explanation.
Could you explain more about ^^^ ? The scaladoc I am looking at doesn't say anything at all about it:
http://www.scala-lang.org/api/beta/scala/util/parsing/combinator/Parsers$Parser.html
I think you can replace something like:
"+" ^^^ { (a:Expr, b:Expr) => EAdd(a,b) }
With simply:
"+" ^^^ EAdd
... I haven't tried it yet, but I think it'll work and might be simpler. Or am I missing something?
Dobes: The scaladoc location has changed since I posted this. It should be pretty easy to find in the API documentation at scala-lang.org.
In the post, when I added the subtraction operator I changed from using repsep to using * because there were now two possible separators (+ or -), and I needed to be able to create different parse trees based on that.
The ^^^ operator throws away the result of the parse and replaces it with the specified value. In this case, the parsed item is the operator character (+ or -), which is discarded and replaced by the corresponding operator function (EAdd or ESub). If you use the "current" rather than "beta" documentation, you should see a bit more info for the ^^^ operator.
Yes, you can replace "(a:Expr, b:Expr) => EAdd(a,b)" in this situation by just "EAdd" and it will still compile and work the same way, but perhaps not for the reason you might expect. If EAdd were a function, then you would have to write "EAdd _" to make that work; but because EAdd is a case class, it happens to work when supplying only the case class name. If you want something with the same type signature here, you would use "EAdd.apply _".
Any tips on how to debug your grammars? My grammar isn't working and I can't figure out why. Doesn't seem to be any tooling to trace parsing activity or anything nice like that.
Dobes: In response to your question I have written a post with a couple of techniques you might be able to use.
Thanks Jim you explained great. I have created my own parser for identifying path but I have a problem with that would you please help me to solve it. Here is a link to the problem: http://stackoverflow.com/questions/30902235/how-to-solve-an-error-related-to-creating-parser-from-regex
Good Morning,
I want to use [ and ] too to parse, for example "[]" or "[Int]" to get EmptyList() or List(TypeInt()) but when I add "[" and "]" to the delimiters every time that I write for example "[]" in the entrance I get illegal token error, so I think that I need to add some other thing but I don't know what...
My delimiter list is:
lexical1.delimiters ++= List("λ", ".", ",","[","]","(",")","\\","|","/","=")
And the rule, that I think that's correct is:
lazy val list: P1[List] = "[" ~> type <~ "]" ^^ {case x => List(x)}
Could you help me please?
Thanks in advance
Oriol: I can't tell without seeing more of your code, but I suspect the problem is that your list expression requires a type between the square brackets, so that "[]" really is an illegal input because it does not specify a type. One of the disadvantages of using these parser combinators is that the error messages are not always very helpful.
You might want to check out my later post on debugging Scala parser combinators (and be sure to read the comments there).
Post a Comment