(Haskell in Haskell) 3. Parsing
This is the third post in the Haskell in Haskell series.
In this post, we’ll go over creating a parser for our subset of Haskell. This stage of the compiler is responsible for taking the tokens that our lexer produced in the previous part.
As usual, both the code for this part, and the entire compiler are available.
Because of all the work we did last time, this part should actually be simpler. We’ve already broken up our source code into bite-size tokens, using whitespace to infer semicolons and braces. Having this clean representation makes our parser’s job much easier.
Parsing in Theory
Before we start coding our parser, let’s try and figure out what a parser is supposed to do, and why we need one anyways.
The goal of parsing is to go from our source code, to a representation that we can actually work with. We want to transform Haskell code like:
foo = 2 + 3 * 4
into some kind of data structure that we can progressively analyze and transform into C code. We’ll need to analyze things like the scoping of names, and the usage of types. We’ll also be gradually transforming and simplifying our representation, with the goal of getting closer and closer to C.
We’ve already written one chunk of this parsing: the lexer. The lexer takes some source code, and groups it up into small tokens. Each token doesn’t make sense as separate chunks. Our parser needs to take these tokens, analyze their structure, and produce a data structure representing the source code.
More than Tokens?
Our lexer already gives us a data structure: a list of tokens. Why not just use this to represent Haskell code?
Unfortunately, this isn’t going to work; the tokens aren’t expressive enough to really reflect the structure of Haskell.
Take this example:
x = 2 + 2
y = 2
After lexing, we get a flat stream of tokens:
{ x = 2 + 2 ; y = 2 }
But this flat representation fails to capture the nested nature of this snippet. A better representation would be something like:
(
(x = (2 + 2))
(y = 2)
)
The expression 2 + 2
“belongs” to the definition of x
, and the expression
2
“belongs” to the definition of y
. This kind of nested structure
is ubiquitous throughout Haskell. A list of tokens
is simply a bad fit here.
Trees
On the other hand, this kind of nesting is perfect for a tree.
Let’s take our previous example once more:
x = 2 + 2
y = 2
This piece of code can be represented by a tree that looks something like this:
Each node in the tree has some kind of label, and can contain other subtrees as children.
We have a node at the top level, containing all the top level definitions.
Definitions
use an =
here, have the name as one child, and the expression as the other.
Addition uses a +
node, with its arguments as children.
This approach scales nicely to expressions with even more nesting:
x =
let y = 2 + 2
in y
We now have a two nested expressions for x
, giving us a tree that looks like:
We can imagine how this might work for even more complicated expressions, since nodes can contain arbitrary trees as their children.
Valid Syntax vs Valid Semantics
This tree of nodes built up from source code is called an Abstract Syntax Tree, or an AST, for short. It’s a syntax tree, since it represents the structure of our programming language: what syntax is valid or not. I don’t actually know why it’s called an abstract tree, but I like to think that it’s abstract, because it provides a representation of trees that are syntactically valid, but that don’t necessarily make any sense.
For example, take this snippet:
x =
let y = "34" + 3
in "x" foo
This bit of code is syntactically valid, and our parser shouldn’t complain. It still has a few issues though:
- You can’t add strings and numbers
- You can’t use a string like
"x"
as a function - The name
foo
is not defined in this scope
Because of these issues, our program is semantically invalid, even though it manages to parse. The goal of the type checker is to find all of these issues. We’ll get to this soon enough, but not in this part.
It’s not our parser’s job to figure these things out yet. Our parser is just concerned with producing this Abstract Syntax Tree. The parser will be taking the tokens produced by the lexer, and transforming them into this tree.
Grammars
One common tool used to describe these syntax rules is a Context Free Grammar, or just Grammar for short. Grammars allow us to think about the structure of a programming language, at least at the level of syntax.
As a motivating example, let’s take a very simplified subset of Haskell.
Our example language consists of a single expression, which might be one of:
A string literal:
"foo"
An int literal:
345
A variable name:
x3
Or, a let expression, with explicit braces and semicolons:
let {
x = 3;
y = 2;
} in x
A Grammar for this language would like something like this:
<expr> ::= <int> | <string> | <name> | <letexpr>
<letexpr> ::= let { <definitions> } in <expr>
<definitions> ::= <singleDefinition> ; <definitions> | <empty>
<singleDefinition> ::= <name> = <expr>
Each <rule>
references some type of syntactic element. For example, <expr>
references
an expression in our little language. After the ::=
, we declare each
way we can build up this element. Everything outside of angle brackets
is an actual token we expect to see.
For <expr>
, this matches the overview we gave earlier: an expression is either
an integer literal, a string literal, a name, or a let expression.
And then, <letexpr>
consists of the token let
, followed by some definitions inside
braces, the token in
, and an expression.
<definitions>
is either empty, or a single definition followed by a ;
,
and some more definitions.
Finally, <singleDefinition>
represents things like x = 3
.
This is just a name, the token =
,
and then an expression.
Intuitively, this grammar allows describes the rules that define what structure our toy language has. These kinds of rules should actually be familiar for us, since our lexer used a similar approach.
For example, to define what a token
was, we built
up a large lexer from smaller possibilities:
token :: Lexer (Token, String)
token = keyword <|> operator <|> literal <|> name
This matches up naturally with a grammar like:
<token> ::= <keyword> | <operator> | <literal> | <name>
<keyword> ::= let | do | ...
...
The big difference is that our tokens were completely flat. Tokens never contained other tokens. Grammars, on the other hand, allow us to talk about recursive structures as well. Our toy language has a recursive structure: expressions can contain let expressions, which also contain expressions. Because of this, you can nest expressions inside of eachother:
let {
x = 2
} in let {
y = x
} in let {
z = y
} in z
Grammars should also seem eerily familiar to you as Haskeller. In Haskell, we have algebraic data types.
For example, let’s define a type for colors. A color can be either red, blue, green, or an arbitrary hex string like “#ABFE00”.
We could represent this with an ADT, like this:
data Color = Red | Blue | Green | Hex String
But, if you squint a bit, this looks kind of like a grammar:
<color> ::= red | blue | green | # <string>
If we revisit the grammar for our toy language:
<expr> ::= <int> | <string> | <name> | <letexpr>
<letexpr> ::= let { <definitions> } in <expr>
<definitions> ::= <singleDefinition> ; <definitions> | <empty>
<singleDefinition> ::= <name> = <expr>
We can go in the other direction, from grammar to ADT, modelling this structure using Haskell types:
data Expr
= IntExpr Int
| StringExpr String
| NameExpr Name
| LetExpr [Definition] Expr
data Definition = Definition Name Expr
There’s not a one-to-one correspondance here, since the grammar is also concerned with syntactic elements, like semicolons, and braces, and we don’t care about those here. But the rough structure is the same, especially in terms of how the data type references itself recursively.
Because Haskell has these nice recursive ADTs, we can have our syntax tree look very similar to the grammar for our language.
We won’t be using grammars directly, but we’ll be doing awfully similar things. I also think it’s important to touch on the concept of grammars, at least briefly, since most of the other resources you’ll read about parsing will use them heavily.
Parser Combinators
Alright, now it’s time to start writing some code. Before we start defining the structure of our syntax tree, or writing the parser itself, we first need to write our tool of choice for making parsers: the Parser Combinator! This is going to be a suped up version of the Lexer Combinator tool we made last time, extended to handle recursive structures!
Let’s first create a module for our parser in our cabal file:
exposed-modules: Ourlude
, Lexer
, Parser
Now let’s go setup the standard module and imports in src/Parser.hs
:
{-# LANGUAGE LambdaCase #-}
module Parser where
import Control.Applicative (Alternative (..), liftA2)
import Data.List (foldl')
import Lexer (Token (..))
import Ourlude
We have LambdaCase
as a convenient extension in this module. We also import
Alternative
, as well as some other utilities.
From Lexer Combinator to Parser Combinator
With Lexer Combinators, a lexer was essentially just a function:
input -> Maybe (a, input)
We take in some input, and then either fail or succeed, returning a value, and the remaining input. This works well for lexing, since a lexer can only produce one result, and it’s always obvious which of two options we need to choose: just use the longest match rule.
This won’t work for parsing. Our structure is much more complicated, so it’s not always obvious that the longest match rule will be correct. What we do instead is to let ambiguity happen, at least temporarily. Instead of either failing, or returning a single way of consuming the input, we return all possible ways to proceed forward.
We now have:
input -> [(a, input)]
By the time we’ve parsed the entire source code, there should be only a single element
in this list. Otherwise, there are multiple ways of parsing some code. For example,
a parser that doesn’t handle operator precedence might see 2 + 3 * 4
and produce both:
(2 + 3) * 4
2 + (3 * 4)
Of course, the second option is correct, because *
has higher precedence than +
, as we’ll
see later.
Ambiguity in the parsing process is fine, as long as we resolve it later. We might need more than one token to resolve this ambiguity, which is ok. As a chief example, take where expressions. These are expressions with the definitions coming after the body they’re used in:
x + y
where
x = 2
y = 3
If in our parser for expressions, we have some rule like:
expr = notWhereExpr <|> whereExpr
Then we’ll have some ambiguity here. This is because we can’t tell which of these parsers is correct
until we see the where
token. For the first 3 tokens,
we have multiple options floating around.
By the time we finish parsing the file, we shouldn’t have any ambiguity left. If we do, then that means that there are multiple ways of interpreting some code. This is basically never a problem in the code itself, but a problem in the implementation of the language. There are certain languages that are inherently ambiguous, like C++, in which case the parser needs to make an arbitrary choice, but Haskell is not ambiguous. If we fail to parse something because of ambiguity, that’s probably a bug in our parser.
The Concrete Type
With all of this in mind, we can go ahead and define a concrete type for Parser
:
newtype Parser a = Parser {runParser :: [Token] -> [(a, [Token])]}
This is very similar to our Lexer
type from the last part. Instead of working over String
,
we know work over [Token]
, instead.
We also return multiple possible ways of parsing, as explained earlier.
Example Building Blocks
To learn how this works better, let’s go ahead and make a couple basic helpers.
First, let’s make a little helper that accepts a single token:
satisfies :: (Token -> Bool) -> Parser Token
satisfies p =
Parser <| \case
t : ts | p t -> [(t, ts)]
_ -> []
This is analogous to Lexer.satisfies
, which accepted a single character.
Our implementation looks at the remaining list of tokens,
and checks if there is a token matching our predicate
at the front of the list.
If so, we return that token, along with the remaining input. Otherwise
we return an empty list, since there’s no way for this parser to succeed.
An immediate application of this satisfies
function is going to be:
token :: Token -> Parser Token
token = (==) >>> satisfies
This just matches against a specific token that we provide.
A similar function to satisfies
is going to be pluck
:
pluck :: (Token -> Maybe a) -> Parser a
The behavior is going to be the same as satisfies
, except now we
can return a value
if the predicate matches. This is useful, because we have a few tokens like
IntLit i
and StringLit s
which also have an argument. This helper
lets us match against a token like this, extracting the argument. For example, we can do:
pluck <| \case
IntLit i -> Just i
_ -> Nothing
to get a parser that accepts an int literal, but produces an Int
.
The implementation of pluck
is similar to satisfies
:
pluck :: (Token -> Maybe a) -> Parser a
pluck f =
Parser <| \case
t : ts -> case f t of
Just res -> [(res, ts)]
_ -> []
_ -> []
Just
replaces the True
case we had previously, but also provides
us with the value we want to return.
Functor and Applicative
Like with Lexer
, we’ll be providing instances of Functor
and Applicative
for Parser
.
For Functor
, we have:
instance Functor Parser where
fmap f (Parser p) = Parser (p >>> fmap (first f))
This just maps over the results that we’ve produced. In fact,
this definition is exactly the same as for Lexer
, since Maybe
and []
are both Functor
s!
For Applicative
, the operations will mean the same thing as
for Lexer
. pure :: a -> Parser a
will give us a parser that always succeeds,
consuming no input, and (<*>) :: Parser (a -> b) -> Parser a -> Parser b
will
give us a parser that runs the first parser, and then runs the second parser.
The return value will be the function
produced by the first with the argument produced by the second.
The difference with Lexer
is that now a parser may return multiple results.
We’re going to use all the combinations of first and second results:
instance Applicative Parser where
pure a = Parser (\input -> [(a, input)])
Parser pF <*> Parser pA =
Parser <| \input -> do
(f, rest) <- pF input
(a, s) <- pA rest
return (f a, s)
pure
does exactly what we said it does before, so it’s not too surprising.
What’s neat with <*>
is that to implement the “combine all possible functions and arguments”
behavior,
we can use the fact that []
implements the Monad
class in the same way that Maybe
does,
allowing us to use do
notation. For []
, instead of allowing us to handle failure,
we are now able to handle “non-determinism”. What ends up happening is
that for each possible pair (f, rest)
of a function, and remaining input,
we return each possible way of using the second parser on that input, and then
applying the function on the argument produced.
In practice, most of the time there won’t be any ambiguity, so:
parser1 *> parser2
will simply mean “run parser1
, and then run parser2
”
On the other hand, if we have:
(parser1A <|> parser1B) *> (parser2A <|> parser2B)
then we’ll get 2 * 2 = 4
paths available to us. We can parse 1A
and then 2A
,
1A
and then 2B
, etc.
Alternative
As you might expect, we’ll also be implementing Alternative
for Parser
,
to get that sweet sweet <|>
operator, and all the goodies that come along with it!
We also need an empty :: Parser a
function, which returns a Parser
that always fails.
instance Alternative Parser where
empty = Parser (const [])
Parser pA <|> Parser pB =
Parser <| \input -> pA input ++ pB input
To get a parser that always fails, we return no valid ways to continue parsing.
For <|>
, we run both parsers, and then combine both of their results using ++
.
Unlike Lexer
, we don’t try to resolve any ambiguity if they both succeed.
As explained earlier, we let the ambiguity exist, by returning all possible
ways of parsing. Our final parser should only end up returning one result though,
even if sub-parsers might temporarily have some ambiguity.
Like last time, the Alternative
instance gives us plenty of goodies for free.
For example, we now have many :: Parser a -> Parser [a]
, which allows
to turn a parser for one item, into a parser for many items, placed one after the other.
There’s also some
, which does the same thing, but requires at least one
item.
For example, if intParser
accepts 3
, then some intParser
would accept 3 4 5 6
, producing
[3, 4, 5, 6]
as our output.
Integrating a Parser
We now have the combinators we need to build our parser. Before we actually start working on parsing our subset of Haskell, let’s go ahead and add a new stage to our compiler. We can get all of this boilerplate out of the way, letting us focus on parsing.
Let’s define a stub type for our syntax tree, and for our errors:
data ParseError = UnimplementedError deriving (Show)
data AST = AST deriving (Show)
Like last time, we just have a single error: the error telling users that our parser is not implemented yet!
Our AST
is just a stub data type as well.
We’ll be completing both of these soon enough. For now, we just want to get this stage of the compiler integrated into the command line program.
We also need to write a simple parser
function that we can expose
as our implementation of this stage:
parser :: [Token] -> Either ParseError AST
parser _ = Left UnimplementedError
This function will take in the list of tokens produced by the lexer, and return either the parsed syntax tree, or an error, indicating that parsing failed.
Finally, we can add the AST
type and the parser
function to the the
exports for this module:
module Parser (AST(..), parser)
Adding a new Stage
We now need to go into Main.hs
, and add a Stage
for the parser. First,
let’s add the Parser
module to our list of imports.
-- ...
import qualified Parser
-- ...
Right next to lexerStage
, we can go ahead and add a new Stage
for our parser:
parserStage :: Stage [Lexer.Token] Parser.AST
parserStage = makeStage "Parser" Parser.parser
The definition of this is essentially the same as we had for lexerStage
.
Stages were introduced in great detail
last time,
so feel free to take a quick refresher if you’d like.
We also need to modify our argument parsing to accept “parse” as a stage. This way, users can do:
haskell-in-haskell parse file.hs
in order to print out the AST they get from parsing file.hs
.
Of course, they’ll still be able to do:
haskell-in-haskell lex file.hs
to just print out the tokens they get after lexing file.hs
.
All we need to do is modify readStage
, giving us:
readStage :: String -> Maybe (String -> IO ())
readStage "lex" =
lexerStage |> printStage |> Just
readStage "parse" =
lexerStage >-> parserStage |> printStage |> Just
readStage _ = Nothing
Here we get to use the fancy >->
operator we defined last time, in order
to compose the lexing stage with the parsing stage. This works out nicely,
since the parsing stage consumes the tokens that the lexing stage produced.
We end up with a stage taking in a String
for the source code,
and producing the AST
that we want to print out.
We can now run all of this, and see something like:
> cabal run haskell-in-haskell -- parse foo.hs
Parser Error:
UnimplementedError
which makes sense, since we haven’t implemented the parser yet. Let’s fix that!
Structure of the AST
Before we can write a parser to convert our tokens into a syntax tree, we need to define the shape of that syntax tree. We’re going to be describing the exact subset of Haskell that we implement, by creating various data types for each construct in the language.
For example, we’re going to have a data type for expressions. One of
the variants will be a binary expression, for things like 1 + 2
,
or 4 * 33
. Another variant might be function application, for things like
f 1 2 x
. etc.
A Rough Idea
Our subset of Haskell is almost cleaved into two distinct parts:
- The part of the language dealing with defining types
- The part of the language dealing with defining values
These 2 parts interact through annotations, which allow us to declare that a given value has a certain type.
For example, if you have:
x :: Int -> Int
x = \x -> x + 1
The Int -> Int
is squarely in the domain of types, and we’ll need some data structures
to describe function types like Int -> Int
. The lamdba expression \x -> x + 1
is clearly
in the domain of values. And the declaration x :: ...
is a point of interaction
between both domains.
Because of this, we can think of our language as being roughly groupable into 3 sections:
- Types: describing them, definining them, creating synonyms
- Expressions
- Definitions: assigning types and expressions to names
There’s a lot of interaction between these categories of course. For example,
let x = 3 in x
is an expression that also uses a definition, inside
of the let
. With that said, it’s still useful to have a bit of categorization
in our head before we start defining everything we need.
Types
So, the first thing we can do is define how we’ll work with types in this subset
of Haskell. We need ways to refer to certain types, like Int
, String
.
We also need to be able to use more complicated types,
like (Int -> String) -> String
. We’ll
need ways to define new types, either through custom types, like
data Color = Red | Blue | Green
, or type synonyms, like type MyInt = Int
.
The Fundamental types
Before we get to the definitions of new types, let’s first define our data structures
representing types that exist. For example,
when you have x :: Int -> Int
, you need some way of representing the
type that comes after the ::
. This is what we’ll be defining now.
Essentially, what we’re doing is defining what a type “is” in our subset of Haskell.
In fact, this definition will be useful beyond just parsing. We can also use
this representation of types in our type checker, for example. Because of this,
we won’t be creating this definition of what at type is in the Parser
module,
but instead creating a new module, in src/Types.hs
, to hold fundamental data
structures for working with types. We’ll expand on this module in further parts,
as we need to do more things with types.
As usual, let’s edit our .cabal
file to add the Types
module:
exposed-modules: Ourlude
, ...
, Types
Now, we can go ahead and define this module in src/Types.hs
:
module Types where
import Ourlude
For now, we just export everything in the module, but we’ll come back and make the imports explicit once we’ve defined our representation of types.
First, let’s think about what kind of types you should be able to work with in
our subset of Haskell. As mentioned already, we’re going to have a primitive
Int
type, the standard integer type we’re used to.
Integer literals will have that type, so 3 :: Int
, for example. We’ve already
lost the element of surprise by defining boolean and string literals in the lexer,
so we’re going to also have primitive boolean and string types in
our language, departing from Haskell slightly.
So we have String
and Bool
as builtin types as well.
In addition to builtin types, we need a way for users to define new types,
and then reference those types. If a user has type X = Int
, then
they should be able to say:
y :: X
y = 3
So we need to reference custom types by name, as well.
Our subset will also have support for polymorphism. You can define functions like:
id x = x
The type of this function is polymorphic:
id :: a -> a
Here, a
is not a type per se, but rather a type variable. We need a way
to reference type variables by name as well.
This also means that custom data types can be polymorphic, and have variables of their own. For example, you might define a list data type as:
data List a = Cons a (List a) | Nil
This will be valid in our language, and introduces the type List _
, where the underscore
needs to be filled in. So you might have List Int
as one type,
List String
as another, and List a
, if a
happens to be a type variable in scope.
In brief, custom types are referenced by name, and also have multiple type variables
as arguments.
Finally, we need a way to talk about the function type A -> B
, which is also
built into the language.
With all of this in mind, let’s just define our representation of types:
type TypeVar = String
type TypeName = String
infixr 2 :->
data Type
= StringT
| IntT
| BoolT
| CustomType TypeName [Type]
| TVar TypeVar
| Type :-> Type
deriving (Eq, Show)
So, we have the basic primitive types as hardcoded options:
Int
corresponds toIntT
in this representationString
corresponds toStringT
Bool
corresponds toBoolT
We have 2 type synonyms, to clarify the difference between TypeVar
, which is a lower
case name, like a
, used for the name of a type variable, and TypeName
, which
is an upper case name, like List
or Foo
, used to reference the name
of a newly defined type.
Type variables have the variant TVar
and take a name as argument. So a
in Haskell
would be TVar "a"
in our representation.
For newly defined types, we have the name of that type, as well as a list of arguments,
which are also types. List Int
would be CustomType "List" [IntT]
in this representation.
You can also nest things, so List (List Int)
would become
CustomType "List" [CustomType "List" [IntT]]
Note that type synonyms are also under this category. If you have type X = Int
,
then you’d use CustomType "X" []
to refer to this type synonym.
Finally, we need a way to represent function types. We create a new operator
:->
for this. (We need the :
so that it can be used in types). With this in place
Int -> Int
becomes IntT :-> IntT
, as expected. Note that we declare
this operator infixr
, which means it’s right associative, like ->
is in Haskell.
This means that:
Int :-> StringT :-> BoolT
is parsed by GHC as:
Int :-> (StringT :-> BoolT)
which is exactly the behavior that we want.
We’ll be using this representation of types throughout the compiler. Let’s go ahead and explicitly export it as well:
module Types
( Type (..),
TypeVar,
TypeName
)
where
Now, let’s use these types to start working on our syntax tree, back in src/Parser.hs
.
Defining new types
The next thing we need to do is add the data structures representing the definitions of new types in our language. There are going to be two ways to define new types:
- Type Synonyms, things like
type X = Int
- Data Definitions, things like
data Animal = Cat | Dog
In our language, these are both considered as a kind of definitions
that can appear at the top level. They’ll both be variants of a common
Definition
type.
We’ll be needing the items we just defined in the Types
module inside of the parser,
so let’s add an import in src/Parser.hs
:
...
import Ourlude
import Types (Type (..), TypeName, TypeVar)
First, let’s define type synonyms.
Note that we don’t allow polymorphic things like type List2 a = List a
,
only type ListInt = List Int
. Because of this, we can define type synonyms
as a simple variant of Definition
:
data Definition
= TypeSynonym TypeName Type
A type synonym has a name, which is an upper case name like X
, and a type that
it’s equal to.
And then, we need to define the structure of data definitions. The basic idea is that we have a name and some list of constructors:
data MyType = VariantA | VariantB | VariantC
We also need the ability to have arguments in each of the constructors:
data MyType = VariantA String | VariantB Int Int | VariantC
We can also make the data type polymorphic, by specifying some type variables next to the name:
data MyType a b = VariantA String | VariantB Int Int | VariantC
Of course, we can use these type variables inside of the variants:
data MyType a b = VariantA String | VariantB Int Int | VariantC a b
And that’s about it for new data types.
Let’s first define a structure for each of the constructors / variants:
type ConstructorName = String
data ConstructorDefinition
= ConstructorDefinition ConstructorName [Type]
deriving (Eq, Show)
A constructor has a given name, like VariantA
, or Cons
, as well as a list
of types, which are the arguments for the constructor. For clarity,
we’ve created a separate synonym, ConstructorName
, for this kind of name. For example,
the VariantB
constructor in our MyType
example would be represented as:
ConstructorDefinition "VariantB" [IntT, IntT]
Now, we can add an extra variant in Definition
for data definitions:
data Definition
= DataDefinition TypeName [TypeVar] [ConstructorDefinition]
| TypeSynonym TypeName Type
deriving (Eq, Show)
A data definition has a name, a list of type variables, and then a list of constructors.
Our MyType
example becomes:
DataDefinition "MyType" ["a", "b"]
[ ConstructorDefinition "VariantA" [StringT],
ConstructorDefinition "VariantB" [IntT, IntT],
ConstructorDefinition "VariantC" [TVar "a", TVar "b"]
]
in this representation.
And now we have a complete way of representing the syntax for defining new types! We’ve already started getting a feel for the tree like structure we’re using to represent our language, and we’re just getting started!
Expressions
In this section, we’ll be defining most of the different kinds of expressions in our language,
except for let
and where
expressions. These will use definitions, which we’ll define
a bit later.
Literals
This first expressions we have are literals. We have int literals,
like 2
and 245
, string literals, like "foo"
and "bar"
, and finally,
boolean literals, namely, True
and False
. We can define a type Literal
representing these:
data Literal
= IntLiteral Int
| StringLiteral String
| BoolLiteral Bool
deriving (Eq, Ord, Show)
We have one type of literal for each of the primitive types in our language.
Let’s also add a type of expression for literals:
data Expr
= LitExpr Literal
deriving (Eq, Show)
As an example, the literal 3
becomes the much more verbose:
LitExpr (IntLiteral 3)
Names
We can also reference names inside of expressions. For example, if
we have some value x :: Int
, we have x + x
as an expression. We can also reference
the constructors for different data types, like Nil
, Red
, etc.
We’ll be representing these names with strings, but for clarity, let’s create a type synonym:
type Name = String
We can then use it to define a kind of expression that just use a name:
data Expr
...
| NameExpr Name
Semantically, the meaning of NameExpr "foo"
is whatever expression
"foo"
is bound to in a given scope. So if we have foo = 3
, then
NameExpr "foo"
should evaluate to IntLiteral 3
, using our expression syntax.
If Expressions
Next we have if expressions, things like:
if y then 3 else 4
if 2 + 3 > 4 then 4 else x + x
These are straightforwardly defined as:
data Expr
...
| IfExpr Expr Expr Expr
We can have arbitrary expressions inside of the condition, if-branch, and else-branch.
So, if y then 3 else 4
becomes:
IfExpr
(NameExpr "y")
(LitExpr (IntLiteral 3))
(LitExpr (IntLiteral 4))
Lambda Expressions
The next kind of expression are lambda expressions, allowing us to define anonymous functions. These look like:
\x -> x + 1
\x -> \y -> x + y
Note that in Haskell, you can define multiple names at once in a lambda. For example:
\x y -> x + y
is valid, and syntax sugar for:
\x -> y -> x + y
The simplifier will deal with this syntax sugar; our parser will accept this verbatim:
data Expr
...
| LambdaExpr [ValName] Expr
We have a list of named parameters, and then an arbitrary expression using them.
So \x y -> x
becomes:
LambdaExpr ["x", "y"] (NameExpr "x")
This is the first time we use the ValName
synonym, which is defined as:
type ValName = String
This will be used in many places for names that exclusively refer to values.
So x
is a ValName
, but something like Nil
would rather be typed
as ConstructorName
.
Function Application
What would our functional language be without function application? In Haskell, function application just takes arguments separated by whitespace:
f 1 2 3
We can also have some arbitrary expression as the function as well:
(f . g) 1 2
This gives us:
data Expr
...
| ApplyExpr Expr [Expr]
Because of currying, f 1 2
is really sugar for (f 1) 2
. But, once
again, we’ll be removing syntax sugar in the simplifier; the parser
needs to accept all of this. So, for now, f 1 2
would be represented as:
ApplyExpr
(NameExpr "f")
[LitExpr (IntLiteral 1), LitExpr (IntLiteral 2)]
Negation
We have a single unary operator in our language: negation. Instead of
having negative numbers, like -3
, this is instead the negation operator
-
applied to the expression 3
. We can also do more complicated things like:
-((\x -> x + 1) 3)
So, we add negation expressions:
data Expr
...
| NegateExpr Expr
This makes the simple -3
into the much more verbose:
NegateExpr (LitExpr (IntLiteral 3))
Binary Operators
With the only unary operator in the language out of the way, let’s add the
many binary operators we need. This will cover arithmetic like a + b + c
,
to function composition f . g . h
, to comparisons with x > y
, etc.
We have:
data Expr
...
| BinExpr BinOp Expr Expr
If we have some complex arithmetic expression, like a + a + b * (x + y)
, this can
still be written as a tree, with each node only having two children:
Or, using the structure we’ve just defined, as:
BinExpr
Add
(BinExpr
Add
(NameExpr "a")
(NameExpr "a")
)
(BinExpr
Mul
(NameExpr "b")
(BinExpr
Add
(NameExpr "x")
(NameExpr "y")
)
)
Of course, we haven’t defined Add
or Mul
, or any of the other members
of BinOp
yet. Let’s briefly go over all of the operators:
data BinOp
= ...
deriving (Eq, Show)
The +
operator is used to add two numbers:
= Add
The -
operator subtracts the right number from the left number:
| Sub
The *
operator is for multiplication:
| Mul
The /
operator is for division:
| Div
We have ++
for string concatenation
| Concat
We have the comparision operators <
, <=
, >
, >=
, which become:
| Less
| LessEqual
| Greater
| GreaterEqual
And the comparison operators ==
and /=
become:
| EqualTo
| NotEqualTo
We also have the two boolean operators ||
, and &&
, which are:
| Or
| And
Finally, we have the function “operators”. Namely, $
and .
. These are defined
as standard functions in normal Haskell, but since we don’t have custom operators,
these are done here as well:
| Cash
| Compose
Pattern Matching
The next kind of expression we’re going to deal with are case
expressions, things
like:
case x of
2 -> x + 2
3 -> x + 3
y -> x + y
or
case x of
Nil -> 0
Cons A (Cons B Nil) -> 1
Cons B Nil -> 2
Cons y Nil -> y
We have some kind of expression that we’re scrutinizing, followed by a series of branches. Each branch has some kind of pattern that we’re matching against, followed by an expression. The value of the case expression is the value of whatever branch happens to match.
So, let’s first define what kind of patterns we have in our language:
data Pattern
= ...
deriving (Eq, Show)
The first kind of pattern is the wildcard: _
. This pattern matches against everything,
but creates no new name:
data Pattern
= WildcardPattern
The next kind of pattern is similar. We can also have a name as a pattern, like:
case 3 of
x -> x
This is like the wildcard, but also assigns the value of whatever it managed to
match to the given name. Here, x
would have the value 3
inside of the branch.
We have:
data Pattern
...
| NamePattern ValName
We use ValName
, since only lower case names like x
are valid here.
Another kind of pattern is for literals. We can match against literal ints, like
3
, literal strings, like "foo"
, or literal bools, like True
, or False
:
data Pattern
...
| LiteralPattern Literal
We also want to be able to match against custom types. If we define a custom
type like data Color = Red | Blue | Green
, we need to be able to do things like:
case color of
Red -> ...
Blue -> ...
Green -> ...
If our data type has constructors, like data Pair = Pair Int String
,
we want to be able to match against the arguments as well, giving us constructions like:
case pair of
Pair 3 "foo" -> ...
Pair x "bar" -> ...
Pair _ _ -> ...
With this in mind, we need to add a variant for these constructor patterns:
data Pattern
...
| ConstructorPattern ConstructorName [Pattern]
We have the name of the constructor, followed by a sequence of arbitrarily complex
patterns. So, given a data type like data List a = Cons a (List a) | Nil
,
and a pattern like:
Cons _ (Cons _ (Cons _ Nil)
we would represent this as:
ConstructorPattern
"Cons"
[ WildcardPattern,
ConstructorPattern "Cons"
[ WildcardPattern,
ConstructorPattern "Cons"
[ WildcardPattern,
ConstructorPattern "Nil" []
]
]
]
And now that we’ve defined patterns, we can immediately define case expressions!
data Expr
...
| CaseExpr Expr [(Pattern, Expr)]
We first have the expression we’re scrutinizing, followed by a list of branches. Each branch contains a pattern, and the expression to use when that pattern matches.
Given an expression like:
case scrut of
0 -> 0
1 -> 1
_ -> 2
we would represent this as:
CaseExpr (NameExpr "scrut")
[ (LiteralPattern (IntLiteral 0), LitExpr (IntLiteral 0)),
(LiteralPattern (IntLiteral 1), LitExpr (IntLiteral 1)),
(WildcardPattern, LitExpr (IntLiteral 2))
]
Definitions
Now that we’ve defined expressions as well as the structure of types, we can combine the two, in order to define the structure of value definitions. This is about the structure of defining new values. In Haskell, there are two parts to a value definition:
- The type annotation, like
x :: Int
- The name definition itself, like
x = 3
The type annotation and name definition don’t have to appear together, and the type annotation isn’t necessary at all. Regardless, our parser will need to represent both.
So, let’s create a structure for both of these:
data Definition
= ValueDefinition ValueDefinition
| DataDefinition TypeName [TypeVar] [ConstructorDefinition]
| TypeSynonym TypeName Type
deriving (Eq, Show)
data ValueDefinition
= ...
deriving (Eq, Show)
We’ve just introduced the notion of a ValueDefinition
, which will contain both
the type annotations, and the name definitions we’ve just touched upon. We’ve also
added this to our concept of Definition
. A definition now includes the two
constructs related to types, namely, data definitions, and type synoyms,
as well as the definitions for values.
For type annotations, the definition is pretty simple:
data ValueDefinition
= TypeAnnotation ValName Type
A type annotation like x :: Int
is represented as the name of the value being annotated,
and the type we’re providing after the ::
.
For name definitions, our first try might be something like:
NameDefinition ValName Expr
The idea is that something like x = 3
would become:
NameDefinition "x" (LitExpr (IntLiteral 3))
This would work for a large class of Haskell definitions, but Haskell has some more syntax sugar around definitions. The first piece of sugar is that instead of having to write:
id = \x -> x
You can instead move the parameter x
to the left side of the =
:
id x = x
Of course, you can have multiple parameters as well:
f a b = a + b
So, our structure for name definitions will need to accomodate these named parameters.
Another bit of sugar is that you can pattern match inside of these definitions:
f 0 1 = 1
f 0 _ = 2
f a b = a + b
A function can be defined with multiple “heads”, each of which has patterns for its arguments. Because of this, our full structure for name definitions becomes:
data ValueDefinition
...
| NameDefinition ValName [Pattern] Expr
We have the name used for this value, a list of patterns for each of the arguments, and an expression for the body of this definition.
As an example, this function:
f :: Int -> Int
f 0 = 0
f a = a
Would become 3 separate definitions:
TypeAnnotation "f" (IntT :-> IntT :-> IntT)
NameDefinition "f" [LiteralPattern (IntLiteral 0)]
(LiteralPattern (IntLiteral 0))
NameDefinition "f" [NamePattern "a"] (NameExpr "a")
We have one definition for the type annotation, and 2 more for each of the 2 heads of the function.
Now that we have a structure for definitions, we can add the two missing
types of expressions: let
and where
expressions. These are
really two ways to express
the same thing: local definitions inside an expression. So:
let x :: Int
x = 2
in x
and
x
where
x :: Int
x = 2
Have exactly the same meaning. For clarity, we’ll just create two different types of expression for each of these:
data Expr
...
| LetExpr [ValueDefinition] Expr
| WhereExpr Expr [ValueDefinition]
The structure is slightly different, to reflect the different order of the two
variants. let
has the definitions before the expression, and where
has
the definitions after the expression.
Finally, now that we have definitions finished, we can actually define our real syntax tree:
newtype AST = AST [Definition] deriving (Eq, Show)
This just says that a Haskell program consists of a sequence of top level definitions. So, something like:
data Color = Red | Blue | Green
type C = Color
x :: C
x = Red
would be a series of 4 definitions, represented as:
AST
[ DataDefinition "Color"
[ ConstructorDefinition "Red" [],
ConstructorDefinition "Blue" [],
ConstructorDefinition "Green" []
],
TypeSynonym "C" (CustomType "Color" []),
ValueDefinition (TypeAnnotation (CustomType "C" [])),
ValueDefinition (NameDefinition "x" [] (NameExpr "Color"))
]
Whew! We now have a complete AST representing our subset of Haskell! Let’s go ahead and add the necessary data types that our parser needs to export:
module Parser
( AST (..),
ValName,
ConstructorName,
Name,
Definition (..),
ConstructorDefinition (..),
ValueDefinition (..),
Expr (..),
Literal (..),
BinOp (..),
Pattern (..),
parser,
)
where
The next step in this post is to write a parser for our AST, using the little combinator framework we created earlier.
Parsing in Practice
So, now it’s time to use all of the little tools we’ve made in order to build up a parser for our AST. We’re going to be building up a parser in a bottom up fashion, starting with simple parsers for simple parts of the language, and combining them piece by piece to make up a parser for the whole language.
Names and Literals
We’re going to be starting with the simplest form of expression: names,
and literals. There were two types of names in the lexer: LowerName
,
for names starting with a lower case letter, and UpperName
for names
starting with an upper case letter. We can make a parser for LowerName
pretty easily:
lowerName :: Parser ValName
lowerName =
pluck <| \case
LowerName n -> Just n
_ -> Nothing
Here we use the pluck :: (Token -> Maybe a) -> Parser a
function we defined previously.
pluck
lets us match against a certain type of token, and also extract some information
about that token. This is useful here, since we want to extract
the string contained in a LowerName
token.
We can use the same trick for UpperName
:
upperName :: Parser TypeName
upperName =
pluck <| \case
UpperName n -> Just n
_ -> Nothing
The only difference here is that we’re extracting from UpperName
instead
of LowerName
.
We’ve created a few different synonyms for String
, to clarify the different
kinds of names we use throughout our AST: things like
TypeVar
, TypeName
, ValName
, etc. We can immediately put the two parsers
we’ve just defined to use, by creating a parser for each of these syntactic
classes:
typeVar :: Parser TypeVar
typeVar = lowerName
typeName :: Parser TypeName
typeName = upperName
valName :: Parser ValName
valName = lowerName
constructorName :: Parser ConstructorName
constructorName = upperName
Each of these kinds of name is either a lower case name, or an upper case name. In some contexts,
for example, function application, we might need to accept both,
which is why we created the Name
synonym. We can add a parser for that as well:
name :: Parser Name
name = valName <|> constructorName
Here we use the <|>
for the first time, to indicate that name
is either a valName
, or a constructorName
, i.e. that we should accept either
the LowerName
token, or the UpperName
token here.
The next thing we can add in this section is literals. We want to be able to parse ints, strings and bools, which make up the built-in values in our subset of Haskell:
literal :: Parser Literal
literal = intLit <|> stringLit <|> boolLit
where
Once again, we use the handy <|>
operator, to say that a literal is either
an integer literal, a string literal, or a bool literal. The lexer has already
done the job of creating a token for each of these, which
makes defining these alternatives very easy:
where
intLit =
pluck <| \case
IntLit i -> Just (IntLiteral i)
_ -> Nothing
stringLit =
pluck <| \case
StringLit s -> Just (StringLiteral s)
_ -> Nothing
boolLit =
pluck <| \case
BoolLit b -> Just (BoolLiteral b)
_ -> Nothing
We use the very handy pluck
once again, in order to match against
the right kind of literal token, and pick out the data it contains. We then
make sure to wrap this data in the right kind of literal for our AST.
Int literals become IntLiteral i
, strings StringLiteral s
, bools BoolLiteral b
,
as we went over before.
Parsing Operators
The next thing we’ll add support for is parsing operators. This requires a bit of thought before we get to writing out our parser though.
Operator Precedence
A first approach towards operators would be to treat them as a kind of punctuation. If you have something like:
x + y - a + 3
The names and literals are like words composing this arithmetic sentence, and the arithmetic operators are like the punctuation between the words. A first approach would start with some parser for these bottom level words. Then you’d use this parser, separated by the arithmetic tokens:
One problem with this approach is precedence. For example, multiplication with
*
has higher precedence than addition with +
. This means that an expression like:
1 + 2 * 3 + 3 * 5
should parse as:
((1 + (2 * 3)) + 3) * 5
and not:
((((1 + 2) * 3) + 3) * 5)
We would get the latter if we parsed addition and multiplication with the same precedence. One way to fix this is to introduce multiple layers, with one layer for each level of precedence:
Addition would parse multiplication, separated by +
, and multiplication would
parse literal factors, separated by *
. This paradigm allows us to skip
levels too. A single item like 3
falls under the “at least one number separated by *
”
category: we just have the single number.
Function application falls under this paradigm as well. We can treate
function application as a very high precedence kind of operator. Instead of
parsing “at least one <sub-factor>
, separated by <operator>
”, we just parse
“at least two <sub-factor>
s”. (This makes sure that,
only f x
is considered as a function application, since f
alone
would just be a name).
With multiplication, addition, and function application, our tower of operators now looks like this:
One aspect is missing though: parentheses! In languages with arithmetic expressions like these, you can manually insert parentheses to override the grouping rules.
For example, this expression:
(1 + 2) * 3
is not the same thing as:
1 + 2 * 3
precisely because the parentheses have been inserted.
There’s a simple rule to make this work. Instead of having the bottom of our tower only have dead-end rules like literals, or simple names, we can instead add a rule that goes all the way to the top, parsing any expression surrounded with parentheses:
Parsing some expression like f (1 + 2)
will work, because we end up with a transition
like this:
This is also useful beyond arithmetic, since we wanted to allow arbitrary expressions inside of parentheses anyways. As bizarre as it may seem, something like:
function (
case x of
1 -> 2
_ -> 3
)
is a valid bit of Haskell. This precedence looping trick will allow us to parse this kind of thing once we expand our expressions to accept more than just operators.
Using this approach, a full precedence tower including all of the operators in our language looks like this:
(See this link for a good overview of the precedence of different operators).
With that in mind, have enough thinking done to start implementing this precedence tower.
Operator Utilities
The parsing for each layer of operator will be very similar, so we can go ahead and define a handful of utilities we’ll be using to build them up.
The first utility is to take a parser for <item>
and turn it
into a parser for ( <item> )
. This is useful for implementing the looping
in our tower of precedence. This utility just requires matching
()
tokens before and after the original parser:
parensed :: Parser a -> Parser a
parensed p = token OpenParens *> p <* token CloseParens
Here we use the convenient *>
and <*
operators:
(*>) :: Applicative f => f a -> f b -> f b
(<*) :: Applicative f => f a -> f b -> f a
These sequence their arguments in the same order as <*>
, but pick a different
result. For our parensed
parser, this does the right thing, parsing
the (
before our original parser, and the )
after, but leaving us with
just the original result.
When thinking about parsing operators, we had the nice of idea of doing something like “parse at least one factor, separated by an operator”. We can make a utility that does this:
sepBy1 :: Parser a -> Parser sep -> Parser [a]
Given a parser for a
, and parser for sep
, this creates a parser that
parses one or more occurrences of a
, separated by sep
. So,
something like sepBy1 literal (token Semicolon)
would parse things like 3;1;"foo"
.
The implementation makes straightforward use of a few utilities for Alternative
:
sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = liftA2 (:) p (many (sep *> p))
Semantically, we first parse p
, and then zero or more occurrences of
sep
followed by p
. So:
<p>
<p> <sep> <p>
<p> <sep> <p> <sep> <p>
are all going to be accepted by this parser.
many :: Alternative f => f a -> f [a]
is provided to us, since parsers
implement the Alternative
class.
We use sep *> p
each time, since we just want the item parsed, and not the separator.
Finally, we use liftA2 (:)
to cons the first item parsed, with the many items
produced by the rest of the parser.
Now, this would work for something simple like multiplication, since we
could parse out many multiplication factors, separated by *
. One slight
problem is that some operators have the same precedence. For example,
+
and -
. We need a modified version of sepBy1
where we remember
the separators we see. In fact, it would be even more useful
if given something like:
1 + 2 - 3 + 4
our parser could recognize this as:
1 (+, 2) (-, 3) (+, 4)
and then build up the correct parse tree, using each operator:
((1 + 2) - 3) + 4
This inspires the following type for parsing operators:
opsL :: Parser (a -> a -> a) -> Parser a -> Parser a
The first parser is used for separators, and the second parser is used
for the factors between separators. Semantically, this acts
like sepBy1
, with the arguments reversed. The key difference is that
our separator parser produces a function used to combine the items
to its left and right. For example, for parsing addition and subtraction,
we could use this as our separator:
(BinExpr Add <$ token Plus) <|> (BinExpr Sub <$ token Dash)
If we see a +
, we combine things as an addition expression, and if we see
a -
, then we combine things with a subtraction expression.
The implementation is similar to sepBy1
opsL :: Parser (a -> a -> a) -> Parser a -> Parser a
opsL sep p = liftA2 squash p (many (liftA2 (,) sep p))
where
squash = foldl' (\acc (combine, a) -> combine acc a)
Instead of discarding the separators, we now keep them alongside the tokens:
Then, we fold from the left, using the combinator that’s now attached to each token we encounter:
In practice, this works out nicely. Given some arithmetic expression like:
a + b + c - d
this groups things in the way we’d expect:
((a + b) + c) - d
One problem is that this works for left associative operators, since we have our large expression on the left, and add items as we scan towards the right.
Some operators are not left associative though! For example,
the function arrow ->
between types associates to the right:
For example, this snippet:
Int -> String -> Int
means:
Int -> (String -> Int)
and not:
(Int -> String) -> Int
We need to create an opsR
function, that does the same thing as opsL
,
but associates to the right instead:
opsR :: Parser (a -> a -> a) -> Parser a -> Parser a
The idea is that given some sequence, like:
A -> B -> C -> D
We end up with:
after grouping the items and the separator, like we did in opsL
.
Next, let’s flip this sequence around, before moving the separator to the right one step:
Now we also need to flip the operator, to make sure that the arguments are passed in the right order.
And then our scanning approach from last time works:
With this in mind, here’s the implementation of opsR
:
opsR :: Parser (a -> a -> a) -> Parser a -> Parser a
opsR sep p = liftA2 squash p (many (liftA2 (,) sep p))
where
shift (oldStart, stack) (combine, a) =
(a, (combine, oldStart) : stack)
squash start annotated =
let (start', annotated') = foldl' shift (start, []) annotated
in foldl' (\acc (combine, a) -> combine a acc) start' annotated'
Our starting point is the same as with opsL
, having parsed our
first item, and the following items along with their separators. From there,
we do a first scan, where we want to end up with each separator
attached to the item on its left, and the last item in our starting list
alone.
We do this with foldl'
, keeping track of the last item we’ve seen, and
the tail of items along with their new separators. When we see a new (sep, a)
pair, a
becomes our new last item, and the previous item
is paired up with sep
, and put onto the tail.
Finally, we can scan over this list as if we were doing a left associative
operator, with the caveat that we swap the arguments to combine
.
This is because we need to apply flip combine
in practice,
and not just combine
, to accomplish the “operator flipping”
we went over in the previous diagram.
In practice, this means that if we have:
A -> B
which we’ve separated into:
(B, [(->, A)])
the ->
function is still expecting the accumulator on its right,
to make, A -> B
.
This took me a while to get a hold of as well, but thinking through the illustrations really helps make sense of how this operation works.
Operators in Practice
At the top of our precedence hierarchy, we have expressions:
expr :: Parser Expr
expr = binExpr
For now, all the expression we have are binary expressions. This is where we’ll have all of our operator parsing.
We have:
binExpr :: Parser Expr
binExpr = cashExpr
where
cashExpr = opsR (BinExpr Cash <$ token Dollar) orExpr
The lowest precedence operator is at the top:
the ubiquituous $
, of course.
Next we have ||
:
orExpr = opsR (BinExpr Or <$ token VBarVBar) andExpr
And then we have &&
:
andExpr =
opsR (BinExpr And <$ token AmpersandAmpersand) comparisonExpr
Next we have comparisons, i.e. <, <=, >, >=, ==, /=
:
comparisonExpr = opsL comparisonOperator concatExpr
where
comparisonOperator =
(BinExpr Less <$ token LeftAngle)
<|> (BinExpr LessEqual <$ token LeftAngleEqual)
<|> (BinExpr Greater <$ token RightAngle)
<|> (BinExpr GreaterEqual <$ token RightAngleEqual)
<|> (BinExpr EqualTo <$ token EqualEqual)
<|> (BinExpr NotEqualTo <$ token FSlashEqual)
Here we have different possibilities, since all of these operators have the same
precedence. For each of the possible tokens, we make sure to associate
the correct binary expression. Each operator parser
always yields exactly the combining function that it represents,
thanks to our use of <$
.
Continuing on, we have ++
, for string concatenation:
concatExpr = opsL (BinExpr Concat <$ token PlusPlus) addSubExpr
Followed by addition and subtraction with +
and -
:
addSubExpr = opsL addSubOperator mulDivExpr
where
addSubOperator =
(BinExpr Add <$ token Plus)
<|> (BinExpr Sub <$ token Dash)
Multiplication and division, with *
and /
, are a bit stronger:
mulDivExpr = opsL mulDivOperator composeExpr
where
mulDivOperator =
(BinExpr Mul <$ token Asterisk)
<|> (BinExpr Div <$ token FSlash)
And finally, the highest precedence binary operator is function composition with .
:
composeExpr = opsR (BinExpr Compose <$ token Dot) unaryExpr
Next come unary operators, of which we only have -x
, for integer negation.
Note that -(+1) 3
is valid haskell, and evaluates to -4
. This still looks
a bit weird to me, but is pretty simple to handle:
unaryExpr :: Parser Expr
unaryExpr = negateExpr <|> appExpr
where
negateExpr = (token Dash *> appExpr) |> fmap NegateExpr
To parse a unary expression, we either find a -
and a function application expression,
or just skip the -
entirely.
appExpr
is going to parse something like f a b c
, and decide whether
or not it’s a function, based on how many things it sees:
appExpr :: Parser Expr
appExpr = some factor |> fmap extract
where
extract [] = error "appExpr: No elements produced after some"
extract [e] = e
extract (e : es) = ApplyExpr e es
If we see a single expression, then we leave that alone. Something like 3
by itself just means 3
. If we see f 3
, i.e. two or more items, then
the first of those items is a function, and the rest are its arguments.
And finally, we have factor
, which is at the bottom of our precedence hierarchy:
factor :: Parser Expr
factor = littExpr <|> nameExpr <|> parensed expr
where
littExpr = fmap LitExpr literal
nameExpr = fmap NameExpr name
A factor is either a simple literal, a name, or it pulls off the beautiful trick we went over earlier: it expects some parentheses, and jumps all the way back to the top of the expression hierarchy!
Other Expressions
The next thing we’re going to tackle is parsing expressions that don’t contain
definitions, i.e. if
, case
and lambdas.
If expressions and lambdas are pretty simple, so let’s get them out of the way:
expr :: Parser Expr
expr = binExpr <|> ifExpr <|> lambdaExpr
where
For if
, it’s pretty much what it says on the tin:
ifExpr =
IfExpr
<$> (token If *> expr)
<*> (token Then *> expr)
<*> (token Else *> expr)
We parse the condition after an if
token, the first branch after
a then
token, and the other branch after an else
token. We use the standard
<$>
and <*>
operators to plumb these results into the IfExpr
constructor,
which takes these three operands.
For lambdas, it’s also not that complicated. A lambda like:
\x y -> x + y
is just the \
token, some names, a ->
token, and then an expression. In code,
we get:
lambdaExpr = token BSlash *>
liftA2 LambdaExpr (some valName) (token ThinArrow *> expr)
For case expressions, we first need a way to parse patterns. Parsing patterns is kind of similar to operators. We can have a constructor with simple arguments:
C 1 2 3 4
but in order to introduce more complicated patterns as arguments, we need parentheses:
C (A 2) 2 3 B
Notice how the B
constructor takes no arguments,
and can appear without as a simple argument here.
Our definition for a single pattern looks like this:
onePattern :: Parser Pattern
onePattern = unspacedPattern <|> argfulPattern
where
argfulPattern =
liftA2 ConstructorPattern constructorName (some unspacedPattern)
A pattern is either a constructor taking some arguments, or a single
argument pattern. unspacedPattern
will parse any kind of pattern that can
appear as an argument to a constructor, perhaps requiring parentheses:
unspacedPattern :: Parser Pattern
unspacedPattern = simplePattern <|> parensed onePattern
where
So, an unspacedPattern
is either an entire pattern, wrapped in parentheses,
or a simple pattern, which can appear as an argument
without any parentheses:
simplePattern =
singleConstructor
<|> wildCardPattern
<|> varPattern
<|> litPattern
singleConstructor = fmap (`ConstructorPattern` []) constructorName
wildCardPattern = WildcardPattern <$ token Underscore
varPattern = fmap NamePattern valName
litPattern = fmap LiteralPattern literal
A pattern like this is either a constructor with no arguments, like
B
in the example before, a wildcard pattern _
, a single named
pattern, like x
, or a literal, like 3
or "foo"
.
With patterns defined, we can start working on case expressions. Note that from the point of view of the parser, case expressions use braces and semicolons:
case x of {
1 -> 2;
_ -> 3
}
This kind of situation will appear with definitions as well, so we can go ahead and make a utility for parsing braces and semicolons:
braced :: Parser a -> Parser [a]
braced p =
token OpenBrace *> sepBy1 p (token Semicolon) <* token CloseBrace
braced
turns a parser for a single item, into a parser for one or more items
separated by semicolons, all inside of braces.
We can then immediately used braced
to create a parser for case expressions:
caseExpr :: Parser Expr
caseExpr =
liftA2 CaseExpr (token Case *> expr <* token Of) (braced patternDef)
where
patternDef = liftA2 (,) onePattern (token ThinArrow *> expr)
We have the case <scrutinee> of
part, followed by the patterns,
which are in braces and semicolons. We than have the <pat> -> <branch>
part. Each pattern is just the
onePattern
we defined before, followed by the ->
token,
and the expression for that branch.
We could add this to the definition of expr
, but we’re going to be
adding a few more expressions shortly, so we can just include it when
we get around to those.
Types
The remaining expressions are those related to definitions:
things like x = 2
, or x :: Int
. We can parse the expression part
of these value definitions just fine, but we haven’t added any parsing
for types yet! We’ll take care of this now, before moving
on to definitions.
Parsing types is going to be very similar to parsing patterns.
We have some types taking in arguments, like Pair Int String
,
or List (List Int)
. We need to distinguish the types
that are basic enough to appear as arguments without parentheses,
like we did with patterns.
We’ll also need to add in a bit of operator parsing, to parse
function types like Int -> String -> Int
.
Our goal is to define:
typeExpr :: Parser Type
which parses an arbitrary type.
Starting from the bottom, we have:
singleType :: Parser Type
singleType = fmap TVar typeVar <|> primType <|> parensed typeExpr
where
primType =
(IntT <$ token IntTypeName)
<|> (StringT <$ token StringTypeName)
<|> (BoolT <$ token BoolTypeName)
The singleType
represents either a fundamental type that is not a kind of
custom type, or an arbitrary type, so long as it’s between parentheses.
We can then use this to define a full parser for type expressions:
typeExpr :: Parser Type
typeExpr = opsR ((:->) <$ token ThinArrow) baseType
where
baseType = singleType <|> customType
customType = liftA2 CustomType typeName (many typeArgument)
typeArgument :: Parser Type
typeArgument = namedType <|> singleType
where
namedType = fmap (`CustomType` []) typeName
We have different type “factors”, separated by the function arrow ->
. This
becomes the constructor :->
, representing the same thing in our syntax tree.
Of course, we make sure that we parse this operator in a right associative
way, so that:
Int -> String -> Int
is parsed as:
IntT :-> (StringT :-> IntT)
The base type that can appear between functions is either the single type
we defined earlier, or a custom type, which we excluded from the definition
of singleType
. A custom type is parsed as some type name, followed by zero
or more arguments.
For the arguments, we do something interesting. We accept singleType
, as
expected, but we also accept namedType
. namedType
is just a single
custom type, without any arguments. This allows us to have
a type like T
as an argument, giving us something like Pair T T
,
without needing to wrap T
in parentheses.
Value Definitions
With that in place, we now have everything we need to parse value definitions. As a reminder, these look like:
x :: Int
x = 3
We have either a type annotation, or a name definition. The parser for this is pretty straightforward:
valueDefinition :: Parser ValueDefinition
valueDefinition = nameDefinition <|> typeAnnotation
where
nameDefinition =
NameDefinition
<$> valName
<*> many unspacedPattern
<*> (token Equal *> expr)
typeAnnotation =
liftA2 TypeAnnotation valName (token DoubleColon *> typeExpr)
A valueDefinition
is either a nameDefinition
, or a typeAnnotation.
A typeAnnotation
is a name, followed by a ::
and some type.
A nameDefinition
is also a name, but this time with multiple patterns
as arguments, the =
token, and then expression. We have multiple
patterns, since Haskell allows the syntax sugar:
f a b = a + b
for:
f = \a b -> a + b
These are unspacedPattern
, since if we need complex patterns, we need them
to appear with parentheses. Something like:
f Pair a b =
will be parsed as:
f (Pair) (a) (b) =
instead of:
f (Pair a b) =
which requires explicit parentheses.
Remaining Expressions
With these value definitions in place, we’re now ready to finish defining all
of our expressions. The remaining expressions
we need to add to expr
are where
and let
expressions:
expr :: Parser Expr
expr = notWhereExpr <|> whereExpr
where
notWhereExpr =
letExpr <|> ifExpr <|> lambdaExpr <|> binExpr <|> caseExpr
ifExpr =
IfExpr
<$> (token If *> expr)
<*> (token Then *> expr)
<*> (token Else *> expr)
lambdaExpr = token BSlash *>
liftA2 LambdaExpr (some valName) (token ThinArrow *> expr)
letExpr = liftA2 LetExpr
(token Let *> braced valueDefinition) (token In *> expr)
whereExpr = liftA2 WhereExpr
notWhereExpr (token Where *> braced valueDefinition)
It should be clear that notWhereExpr <|> whereExpr
includes every
kind of expression, unless you’re a constructive mathematician.
For let expressions, we just parse the token let
, followed by
some value definitions in braces and semicolons, the in
token,
and then an expression.
Remember that let
introduces a layout, so:
let
x :: Int
x = 2
in x
is lingo for:
let {
x :: Int;
x = 2
} in x
Now, for where
expressions, it’s tempting to do something like:
expr *> token Where *> braced valueDefinition
but the problem is that this parser recurses infinitely. This is an issue we haven’t touched upon yet, called left recursion. This arises in theory for parsing methods like ours, and for parser combinators in particular, even in a lazy language like Haskell.
The fundamental problem is that if we have a rule like this,
to parse expr
, we try parsing expr
, which means that we try
to parse expr
, and so we see if we can parse expr
, and we give
a shot at parsing expr
, and then, out of curiosity, see
if maybe parsing expr
might work, etc.
We aren’t able to make any progress, because we have no concrete rules to latch on to.
The solution is to not have expr
on the left here, but notWhereExpr
instead. This way, if our parser starts “exploring” this path, it can’t immediately
loop back onto itself, but instead has to parse a different rule
that will allow it to make progress.
This is why we define notWhereExpr
to include everything except whereExpr
,
to prevent this kind of immediate loop, with no tokens in between.
Anyhow, with this trick resolved, we have now finished parsing all of the expressions in our language. The end is in sight!
Other Definitions
The only remaining items to parse in our syntax tree are custom type definitions, and type synonyms, which make up the remaining kinds of definition in our language.
For type definitions, we’ll need a parser for each of the constructors that make up definitions like:
data VariantA = A Int String | B String Int
This gives us:
constructorDefinition :: Parser ConstructorDefinition
constructorDefinition =
liftA2 ConstructorDefinition constructorName (many typeArgument)
We can reuse the typeArgument
we defined earlier, which was
used for the simple types that appear as arguments to a custom type,
like Pair Int (List String)
. Of course, if we have a constructor:
PairConstructor Int (List String)
then it’s natural to reuse the same parsing scheme for its arguments.
A constructor doesn’t necessarily have any arguments, hence the many
,
instead of some
.
Finally, we can use this to parse the remaining definitions:
definition :: Parser Definition
definition = valueDefinition' <|> dataDefinition <|> typeSynonym
where
valueDefinition' = fmap ValueDefinition valueDefinition
dataDefinition =
DataDefinition
<$> (token Data *> typeName)
<*> many typeVar
<*> (token Equal *> sepBy1 constructorDefinition (token VBar))
typeSynonym = liftA2 TypeSynonym
(token Type *> typeName) (token Equal *> typeExpr)
A definition is either a value definition, a data definition, or a type synonym.
Type synonyms are pretty simple. We have a type
token, the
name of the synonym, and then =
, followed by the type we’re making
an alias of.
For data definitions, we first have the data
token, followed by the
name of the new custom type, and any type variables it might have.
We have then have an =
token, followed by its constructors. There
must be one or more constructors, each of them separated by a |
token.
And now we can parse all of the definitions in our language. We just have a bit of plumbing left to do to connect all of this together.
Gluing it all together
A Haskell program consists of a series of top level definitions. The whole file is actually in a braced environment, which means that we have the following parser for the entire AST:
ast :: Parser AST
ast = fmap AST (braced definition)
These definitions include both value definitions, so definition values by assigning them expressions, or annotating their types, as well as defining new types, through type synonyms, or data definitions.
We also need to amend our ParseError
type to be a bit more informative:
data ParseError
= FailedParse
| AmbiguousParse [(AST, [Token])]
deriving (Show)
While still not as informative as you really would like, we now have two kinds of errors. The first occurrs if our parser found no way to parse the tokens given to us, and the second occurrs if multiple possible ASTs have been produced after going through all of the tokens.
I’ve mentioned ambiguity being a concern when designing some aspects of the parser, and avoiding this kind of error is the reasoning behind some of the design.
With these errors in mind, we can amend our parser
function to use
the ast
parser we just defined:
parser :: [Token] -> Either ParseError AST
parser input = case runParser ast input of
[] -> Left FailedParse
[(res, _)] -> Right res
tooMany -> Left (AmbiguousParse tooMany)
We use runParser :: Parser a -> [(a, [Token])]
to get a list of parse results,
after running the parser for our entire syntax tree on the input tokens.
If we have no possible parses, then that’s the first error we defined above.
If we have too many parses, that’s the ambiguous error. Finally,
if we just have a single result, then that’s the AST that we want our parser to return.
And with that, we’ve finished our parser! It’s time to pat ourselves on the back, and try it out with a few examples.
Examples
If we run
cabal run haskell-in-haskell -- parse file.hs
we’ll be able to see the parse tree for a given source program.
For something simple like:
x :: Int
x = 2
we get:
AST
[ ValueDefinition
( TypeAnnotation "x" IntT )
, ValueDefinition
( NameDefinition "x" []
( LitExpr
( IntLiteral 2 )
)
)
]
which is quite a bit more verbose, but matches what we expect. We see the definition
of x
pop up, first as a type annotation, and then once more, as the actual
expression.
We can make more complex arithmetic of course:
x :: Int
x = 1 + 2 * 3
This program gives us:
AST
[ ValueDefinition
( TypeAnnotation "x" IntT )
, ValueDefinition
( NameDefinition "x" []
( BinExpr Add
( LitExpr
( IntLiteral 1 )
)
( BinExpr Mul
( LitExpr
( IntLiteral 2 )
)
( LitExpr
( IntLiteral 3 )
)
)
)
)
]
We can also do let
expressions:
x :: Int
x =
let y = 2
in y
AST
[ ValueDefinition
( TypeAnnotation "x" IntT )
, ValueDefinition
( NameDefinition "x" []
( LetExpr
[ NameDefinition "y" []
( LitExpr
( IntLiteral 2 )
)
]
( NameExpr "y" )
)
)
]
This works for even more complicated programs, of course, but I’ll spare you the very large syntax trees
As a final example, let’s do some basic pattern matching with a custom type:
data Color = Red | Blue | Green
f Red = 0
f Blue = 1
f Green = 2
AST
[ DataDefinition "Color" []
[ ConstructorDefinition "Red" []
, ConstructorDefinition "Blue" []
, ConstructorDefinition "Green" []
]
, ValueDefinition
( NameDefinition "f"
[ ConstructorPattern "Red" [] ]
( LitExpr
( IntLiteral 0 )
)
)
, ValueDefinition
( NameDefinition "f"
[ ConstructorPattern "Blue" [] ]
( LitExpr
( IntLiteral 1 )
)
)
, ValueDefinition
( NameDefinition "f"
[ ConstructorPattern "Green" [] ]
( LitExpr
( IntLiteral 2 )
)
)
]
We first have the definition for this custom type, with
its three different constructors. We then have three separate definitions
for the function f
, since it has three different heads. Each of those
heads has a pattern it matches against, which we see in the NameDefinition
node as well.
Anyways, it’s a lot of fun to play around with this for different programs, and I’ve gotten distracted playing around with it again, even through I wrote this parser months before I got around to writing this post.
Conclusion
Hopefully this was an interesting post, as we get deeper and deeper into the guts of the compiler. Hopefully the little sprinkling of theory was enough to make the implementation understandable, but this post did end up being longer than I expected it to be.
The full code for this part, and the entire compiler are available, as a reference.
As usual, Crafting Interpreters has a great chapter on parsing, which might be a good read if you’re still not sure how exactly parsers work, even if the combinator approach works intuitively.
If you want a meatier text that goes into different theoretical methods of parsing, including thornier issues like left recursion, I like the book Engineering a Compiler. The third chapter of this book is about parsing, and covers all of this theory in great detail.
In this next post, we’ll be going over the simplifier, to prepare our syntax tree for consumption by our type checker. The simplifier will have some drudge work to do, like resolving information about the type synonyms used in our program, and the signatures of each constructor. Another super interesting part is removing all of the nesting from pattern matching, making the rest of the compiler much simpler.
Anyways, I’m getting ahead of myself, but that’s a bit of a preview for the next part in this series. At the rate it takes me to write these, this’ll most likely be coming around next year.
So, I wish you all a (belated) Merry Christmas, and a Happy New Year!