(Haskell in Haskell) 3. Parsing

This is the third post in the Haskell in Haskell series.

In this post, we’ll go over creating a parser for our subset of Haskell. This stage of the compiler is responsible for taking the tokens that our lexer produced in the previous part.

As usual, both the code for this part, and the entire compiler are available.

Because of all the work we did last time, this part should actually be simpler. We’ve already broken up our source code into bite-size tokens, using whitespace to infer semicolons and braces. Having this clean representation makes our parser’s job much easier.

Parsing in Theory

Before we start coding our parser, let’s try and figure out what a parser is supposed to do, and why we need one anyways.

The goal of parsing is to go from our source code, to a representation that we can actually work with. We want to transform Haskell code like:

foo = 2 + 3 * 4

into some kind of data structure that we can progressively analyze and transform into C code. We’ll need to analyze things like the scoping of names, and the usage of types. We’ll also be gradually transforming and simplifying our representation, with the goal of getting closer and closer to C.

We’ve already written one chunk of this parsing: the lexer. The lexer takes some source code, and groups it up into small tokens. Each token doesn’t make sense as separate chunks. Our parser needs to take these tokens, analyze their structure, and produce a data structure representing the source code.

More than Tokens?

Our lexer already gives us a data structure: a list of tokens. Why not just use this to represent Haskell code?

Unfortunately, this isn’t going to work; the tokens aren’t expressive enough to really reflect the structure of Haskell.

Take this example:

x = 2 + 2

y = 2

After lexing, we get a flat stream of tokens:

{ x = 2 + 2 ; y = 2 }

But this flat representation fails to capture the nested nature of this snippet. A better representation would be something like:

(
(x = (2 + 2))
(y = 2)
)

The expression 2 + 2 “belongs” to the definition of x, and the expression 2 “belongs” to the definition of y. This kind of nested structure is ubiquitous throughout Haskell. A list of tokens is simply a bad fit here.

Trees

On the other hand, this kind of nesting is perfect for a tree.

Let’s take our previous example once more:

x = 2 + 2

y = 2

This piece of code can be represented by a tree that looks something like this:

Each node in the tree has some kind of label, and can contain other subtrees as children. We have a node at the top level, containing all the top level definitions. Definitions use an = here, have the name as one child, and the expression as the other. Addition uses a + node, with its arguments as children.

This approach scales nicely to expressions with even more nesting:

x =
  let y = 2 + 2
  in y

We now have a two nested expressions for x, giving us a tree that looks like:

We can imagine how this might work for even more complicated expressions, since nodes can contain arbitrary trees as their children.

Valid Syntax vs Valid Semantics

This tree of nodes built up from source code is called an Abstract Syntax Tree, or an AST, for short. It’s a syntax tree, since it represents the structure of our programming language: what syntax is valid or not. I don’t actually know why it’s called an abstract tree, but I like to think that it’s abstract, because it provides a representation of trees that are syntactically valid, but that don’t necessarily make any sense.

For example, take this snippet:

x =
  let y = "34" + 3
  in "x" foo

This bit of code is syntactically valid, and our parser shouldn’t complain. It still has a few issues though:

You can’t add strings and numbers
You can’t use a string like "x" as a function
The name foo is not defined in this scope

Because of these issues, our program is semantically invalid, even though it manages to parse. The goal of the type checker is to find all of these issues. We’ll get to this soon enough, but not in this part.

It’s not our parser’s job to figure these things out yet. Our parser is just concerned with producing this Abstract Syntax Tree. The parser will be taking the tokens produced by the lexer, and transforming them into this tree.

Grammars

One common tool used to describe these syntax rules is a Context Free Grammar, or just Grammar for short. Grammars allow us to think about the structure of a programming language, at least at the level of syntax.

As a motivating example, let’s take a very simplified subset of Haskell.

Our example language consists of a single expression, which might be one of:

A string literal:

"foo"

An int literal:

A variable name:

x3

Or, a let expression, with explicit braces and semicolons:

let {
  x = 3;
  y = 2;
} in x

A Grammar for this language would like something like this:

<expr> ::= <int> | <string> | <name> | <letexpr>

<letexpr> ::= let { <definitions> } in <expr>

<definitions> ::= <singleDefinition> ; <definitions> | <empty>

<singleDefinition> ::= <name> = <expr>

Each <rule> references some type of syntactic element. For example, <expr> references an expression in our little language. After the ::=, we declare each way we can build up this element. Everything outside of angle brackets is an actual token we expect to see.

For <expr>, this matches the overview we gave earlier: an expression is either an integer literal, a string literal, a name, or a let expression.

And then, <letexpr> consists of the token let, followed by some definitions inside braces, the token in, and an expression.

<definitions> is either empty, or a single definition followed by a ;, and some more definitions.

Finally, <singleDefinition> represents things like x = 3. This is just a name, the token =, and then an expression.

Intuitively, this grammar allows describes the rules that define what structure our toy language has. These kinds of rules should actually be familiar for us, since our lexer used a similar approach.

For example, to define what a token was, we built up a large lexer from smaller possibilities:

token :: Lexer (Token, String)
token = keyword <|> operator <|> literal <|> name

This matches up naturally with a grammar like:

<token> ::= <keyword> | <operator> | <literal> | <name>

<keyword> ::= let | do | ...

...

The big difference is that our tokens were completely flat. Tokens never contained other tokens. Grammars, on the other hand, allow us to talk about recursive structures as well. Our toy language has a recursive structure: expressions can contain let expressions, which also contain expressions. Because of this, you can nest expressions inside of eachother:

let {
  x = 2
} in let {
  y = x
} in let {
  z = y
} in z

Grammars should also seem eerily familiar to you as Haskeller. In Haskell, we have algebraic data types.

For example, let’s define a type for colors. A color can be either red, blue, green, or an arbitrary hex string like “#ABFE00”.

We could represent this with an ADT, like this:

data Color = Red | Blue | Green | Hex String

But, if you squint a bit, this looks kind of like a grammar:

<color> ::= red | blue | green | # <string>

If we revisit the grammar for our toy language:

<expr> ::= <int> | <string> | <name> | <letexpr>

<letexpr> ::= let { <definitions> } in <expr>

<definitions> ::= <singleDefinition> ; <definitions> | <empty>

<singleDefinition> ::= <name> = <expr>

We can go in the other direction, from grammar to ADT, modelling this structure using Haskell types:

data Expr
  = IntExpr Int
  | StringExpr String
  | NameExpr Name
  | LetExpr [Definition] Expr

data Definition = Definition Name Expr

There’s not a one-to-one correspondance here, since the grammar is also concerned with syntactic elements, like semicolons, and braces, and we don’t care about those here. But the rough structure is the same, especially in terms of how the data type references itself recursively.

Because Haskell has these nice recursive ADTs, we can have our syntax tree look very similar to the grammar for our language.

We won’t be using grammars directly, but we’ll be doing awfully similar things. I also think it’s important to touch on the concept of grammars, at least briefly, since most of the other resources you’ll read about parsing will use them heavily.

Parser Combinators

Alright, now it’s time to start writing some code. Before we start defining the structure of our syntax tree, or writing the parser itself, we first need to write our tool of choice for making parsers: the Parser Combinator! This is going to be a suped up version of the Lexer Combinator tool we made last time, extended to handle recursive structures!

Let’s first create a module for our parser in our cabal file:

  exposed-modules:     Ourlude
                     , Lexer
                     , Parser

Now let’s go setup the standard module and imports in src/Parser.hs:

{-# LANGUAGE LambdaCase #-}

module Parser where

import Control.Applicative (Alternative (..), liftA2)
import Data.List (foldl')
import Lexer (Token (..))
import Ourlude

We have LambdaCase as a convenient extension in this module. We also import Alternative, as well as some other utilities.

From Lexer Combinator to Parser Combinator

With Lexer Combinators, a lexer was essentially just a function:

input -> Maybe (a, input)

We take in some input, and then either fail or succeed, returning a value, and the remaining input. This works well for lexing, since a lexer can only produce one result, and it’s always obvious which of two options we need to choose: just use the longest match rule.

This won’t work for parsing. Our structure is much more complicated, so it’s not always obvious that the longest match rule will be correct. What we do instead is to let ambiguity happen, at least temporarily. Instead of either failing, or returning a single way of consuming the input, we return all possible ways to proceed forward.

We now have:

input -> [(a, input)]

By the time we’ve parsed the entire source code, there should be only a single element in this list. Otherwise, there are multiple ways of parsing some code. For example, a parser that doesn’t handle operator precedence might see 2 + 3 * 4 and produce both:

(2 + 3) * 4
2 + (3 * 4)

Of course, the second option is correct, because * has higher precedence than +, as we’ll see later.

Ambiguity in the parsing process is fine, as long as we resolve it later. We might need more than one token to resolve this ambiguity, which is ok. As a chief example, take where expressions. These are expressions with the definitions coming after the body they’re used in:

x + y
  where
    x = 2
    y = 3

If in our parser for expressions, we have some rule like:

expr = notWhereExpr <|> whereExpr

Then we’ll have some ambiguity here. This is because we can’t tell which of these parsers is correct until we see the where token. For the first 3 tokens, we have multiple options floating around.

By the time we finish parsing the file, we shouldn’t have any ambiguity left. If we do, then that means that there are multiple ways of interpreting some code. This is basically never a problem in the code itself, but a problem in the implementation of the language. There are certain languages that are inherently ambiguous, like C++, in which case the parser needs to make an arbitrary choice, but Haskell is not ambiguous. If we fail to parse something because of ambiguity, that’s probably a bug in our parser.

The Concrete Type

With all of this in mind, we can go ahead and define a concrete type for Parser:

newtype Parser a = Parser {runParser :: [Token] -> [(a, [Token])]}

This is very similar to our Lexer type from the last part. Instead of working over String, we know work over [Token], instead. We also return multiple possible ways of parsing, as explained earlier.

Example Building Blocks

To learn how this works better, let’s go ahead and make a couple basic helpers.

First, let’s make a little helper that accepts a single token:

satisfies :: (Token -> Bool) -> Parser Token
satisfies p =
  Parser <| \case
    t : ts | p t -> [(t, ts)]
    _ -> []

This is analogous to Lexer.satisfies, which accepted a single character. Our implementation looks at the remaining list of tokens, and checks if there is a token matching our predicate at the front of the list. If so, we return that token, along with the remaining input. Otherwise we return an empty list, since there’s no way for this parser to succeed.

An immediate application of this satisfies function is going to be:

token :: Token -> Parser Token
token = (==) >>> satisfies

This just matches against a specific token that we provide.

A similar function to satisfies is going to be pluck:

pluck :: (Token -> Maybe a) -> Parser a

The behavior is going to be the same as satisfies, except now we can return a value if the predicate matches. This is useful, because we have a few tokens like IntLit i and StringLit s which also have an argument. This helper lets us match against a token like this, extracting the argument. For example, we can do:

pluck <| \case
  IntLit i -> Just i
  _ -> Nothing

to get a parser that accepts an int literal, but produces an Int.

The implementation of pluck is similar to satisfies:

pluck :: (Token -> Maybe a) -> Parser a
pluck f =
  Parser <| \case
    t : ts -> case f t of
      Just res -> [(res, ts)]
      _ -> []
    _ -> []

Just replaces the True case we had previously, but also provides us with the value we want to return.

Functor and Applicative

Like with Lexer, we’ll be providing instances of Functor and Applicative for Parser.

For Functor, we have:

instance Functor Parser where
  fmap f (Parser p) = Parser (p >>> fmap (first f))

This just maps over the results that we’ve produced. In fact, this definition is exactly the same as for Lexer, since Maybe and [] are both Functors!

For Applicative, the operations will mean the same thing as for Lexer. pure :: a -> Parser a will give us a parser that always succeeds, consuming no input, and (<*>) :: Parser (a -> b) -> Parser a -> Parser b will give us a parser that runs the first parser, and then runs the second parser. The return value will be the function produced by the first with the argument produced by the second. The difference with Lexer is that now a parser may return multiple results. We’re going to use all the combinations of first and second results:

instance Applicative Parser where
  pure a = Parser (\input -> [(a, input)])
  Parser pF <*> Parser pA =
    Parser <| \input -> do
      (f, rest) <- pF input
      (a, s) <- pA rest
      return (f a, s)

pure does exactly what we said it does before, so it’s not too surprising.

What’s neat with <*> is that to implement the “combine all possible functions and arguments” behavior, we can use the fact that [] implements the Monad class in the same way that Maybe does, allowing us to use do notation. For [], instead of allowing us to handle failure, we are now able to handle “non-determinism”. What ends up happening is that for each possible pair (f, rest) of a function, and remaining input, we return each possible way of using the second parser on that input, and then applying the function on the argument produced.

In practice, most of the time there won’t be any ambiguity, so:

parser1 *> parser2

will simply mean “run parser1, and then run parser2”

On the other hand, if we have:

(parser1A <|> parser1B) *> (parser2A <|> parser2B)

then we’ll get 2 * 2 = 4 paths available to us. We can parse 1A and then 2A, 1A and then 2B, etc.

Alternative

As you might expect, we’ll also be implementing Alternative for Parser, to get that sweet sweet <|> operator, and all the goodies that come along with it! We also need an empty :: Parser a function, which returns a Parser that always fails.

instance Alternative Parser where
  empty = Parser (const [])
  Parser pA <|> Parser pB =
    Parser <| \input -> pA input ++ pB input

To get a parser that always fails, we return no valid ways to continue parsing.

For <|>, we run both parsers, and then combine both of their results using ++. Unlike Lexer, we don’t try to resolve any ambiguity if they both succeed. As explained earlier, we let the ambiguity exist, by returning all possible ways of parsing. Our final parser should only end up returning one result though, even if sub-parsers might temporarily have some ambiguity.

Like last time, the Alternative instance gives us plenty of goodies for free. For example, we now have many :: Parser a -> Parser [a], which allows to turn a parser for one item, into a parser for many items, placed one after the other. There’s also some, which does the same thing, but requires at least one item.

For example, if intParser accepts 3, then some intParser would accept 3 4 5 6, producing [3, 4, 5, 6] as our output.

Integrating a Parser

We now have the combinators we need to build our parser. Before we actually start working on parsing our subset of Haskell, let’s go ahead and add a new stage to our compiler. We can get all of this boilerplate out of the way, letting us focus on parsing.

Let’s define a stub type for our syntax tree, and for our errors:

data ParseError = UnimplementedError deriving (Show)

data AST = AST deriving (Show)

Like last time, we just have a single error: the error telling users that our parser is not implemented yet!

Our AST is just a stub data type as well.

We’ll be completing both of these soon enough. For now, we just want to get this stage of the compiler integrated into the command line program.

We also need to write a simple parser function that we can expose as our implementation of this stage:

parser :: [Token] -> Either ParseError AST
parser _ = Left UnimplementedError

This function will take in the list of tokens produced by the lexer, and return either the parsed syntax tree, or an error, indicating that parsing failed.

Finally, we can add the AST type and the parser function to the the exports for this module:

module Parser (AST(..), parser)

Adding a new Stage

We now need to go into Main.hs, and add a Stage for the parser. First, let’s add the Parser module to our list of imports.

-- ...
import qualified Parser
-- ...

Right next to lexerStage, we can go ahead and add a new Stage for our parser:

parserStage :: Stage [Lexer.Token] Parser.AST
parserStage = makeStage "Parser" Parser.parser

The definition of this is essentially the same as we had for lexerStage. Stages were introduced in great detail last time, so feel free to take a quick refresher if you’d like.

We also need to modify our argument parsing to accept “parse” as a stage. This way, users can do:

haskell-in-haskell parse file.hs

in order to print out the AST they get from parsing file.hs.

Of course, they’ll still be able to do:

haskell-in-haskell lex file.hs

to just print out the tokens they get after lexing file.hs.

All we need to do is modify readStage, giving us:

readStage :: String -> Maybe (String -> IO ())
readStage "lex" =
  lexerStage |> printStage |> Just
readStage "parse" =
  lexerStage >-> parserStage |> printStage |> Just
readStage _ = Nothing

Here we get to use the fancy >-> operator we defined last time, in order to compose the lexing stage with the parsing stage. This works out nicely, since the parsing stage consumes the tokens that the lexing stage produced. We end up with a stage taking in a String for the source code, and producing the AST that we want to print out.

We can now run all of this, and see something like:

> cabal run haskell-in-haskell -- parse foo.hs
Parser Error:
UnimplementedError

which makes sense, since we haven’t implemented the parser yet. Let’s fix that!

Structure of the AST

Before we can write a parser to convert our tokens into a syntax tree, we need to define the shape of that syntax tree. We’re going to be describing the exact subset of Haskell that we implement, by creating various data types for each construct in the language.

For example, we’re going to have a data type for expressions. One of the variants will be a binary expression, for things like 1 + 2, or 4 * 33. Another variant might be function application, for things like f 1 2 x. etc.

A Rough Idea

Our subset of Haskell is almost cleaved into two distinct parts:

The part of the language dealing with defining types
The part of the language dealing with defining values

These 2 parts interact through annotations, which allow us to declare that a given value has a certain type.

For example, if you have:

x :: Int -> Int
x = \x -> x + 1

The Int -> Int is squarely in the domain of types, and we’ll need some data structures to describe function types like Int -> Int. The lamdba expression \x -> x + 1 is clearly in the domain of values. And the declaration x :: ... is a point of interaction between both domains.

Because of this, we can think of our language as being roughly groupable into 3 sections:

Types: describing them, definining them, creating synonyms
Expressions
Definitions: assigning types and expressions to names

There’s a lot of interaction between these categories of course. For example, let x = 3 in x is an expression that also uses a definition, inside of the let. With that said, it’s still useful to have a bit of categorization in our head before we start defining everything we need.

Types

So, the first thing we can do is define how we’ll work with types in this subset of Haskell. We need ways to refer to certain types, like Int, String. We also need to be able to use more complicated types, like (Int -> String) -> String. We’ll need ways to define new types, either through custom types, like data Color = Red | Blue | Green, or type synonyms, like type MyInt = Int.

The Fundamental types

Before we get to the definitions of new types, let’s first define our data structures representing types that exist. For example, when you have x :: Int -> Int, you need some way of representing the type that comes after the ::. This is what we’ll be defining now.

Essentially, what we’re doing is defining what a type “is” in our subset of Haskell. In fact, this definition will be useful beyond just parsing. We can also use this representation of types in our type checker, for example. Because of this, we won’t be creating this definition of what at type is in the Parser module, but instead creating a new module, in src/Types.hs, to hold fundamental data structures for working with types. We’ll expand on this module in further parts, as we need to do more things with types.

There’s an argument to be made for never sharing things like this between different stages. It might be more maintainable to distinguish between the syntax of what constitutes a type, and the semantics of what a type really is. Essentially, the difference is that the former talks about what things the user can type in their source code to tell your compiler about types, but the latter defines how the compiler itself manipulates types.

For example, this approach allows you to add syntax sugar for different types: maybe A <-> B is syntax sugar for (A -> B, B -> A) in your language. Your representation of types later on doesn’t have to deal with this extra construction, if you separate the syntactic types from the “actual” types.

We won’t be using a separation like this for our types, but we will have a simplifier phase for expressions. We’ll be going over this phase in the next post. It just turns out that for types, there’s not really anything we can simplify, so it’s a bit more convenient to not bother with the extra boilerplate of having two isomorphic representations.

As usual, let’s edit our .cabal file to add the Types module:

  exposed-modules:     Ourlude
                     , ...
                     , Types

Now, we can go ahead and define this module in src/Types.hs:

module Types where

import Ourlude

For now, we just export everything in the module, but we’ll come back and make the imports explicit once we’ve defined our representation of types.

First, let’s think about what kind of types you should be able to work with in our subset of Haskell. As mentioned already, we’re going to have a primitive Int type, the standard integer type we’re used to. Integer literals will have that type, so 3 :: Int, for example. We’ve already lost the element of surprise by defining boolean and string literals in the lexer, so we’re going to also have primitive boolean and string types in our language, departing from Haskell slightly. So we have String and Bool as builtin types as well.

Our subset of Haskell would be expressive enough to define these in the language itself:

data Bool = True | False

data String = StringCons Char String | EmptyString

The reason I’ve opted to make these primitive instead is that to do this thing well, you need some kind of module system, which is out of the scope for the subset we’re trying to make with this series.

In addition to builtin types, we need a way for users to define new types, and then reference those types. If a user has type X = Int, then they should be able to say:

y :: X
y = 3

So we need to reference custom types by name, as well.

Our subset will also have support for polymorphism. You can define functions like:

id x = x

The type of this function is polymorphic:

id :: a -> a

Here, a is not a type per se, but rather a type variable. We need a way to reference type variables by name as well.

This also means that custom data types can be polymorphic, and have variables of their own. For example, you might define a list data type as:

data List a = Cons a (List a) | Nil

This will be valid in our language, and introduces the type List _, where the underscore needs to be filled in. So you might have List Int as one type, List String as another, and List a, if a happens to be a type variable in scope. In brief, custom types are referenced by name, and also have multiple type variables as arguments.

Finally, we need a way to talk about the function type A -> B, which is also built into the language.

With all of this in mind, let’s just define our representation of types:

type TypeVar = String

type TypeName = String

infixr 2 :->

data Type
  = StringT
  | IntT
  | BoolT
  | CustomType TypeName [Type]
  | TVar TypeVar
  | Type :-> Type
  deriving (Eq, Show)

So, we have the basic primitive types as hardcoded options:

Int corresponds to IntT in this representation
String corresponds to StringT
Bool corresponds to BoolT

We have 2 type synonyms, to clarify the difference between TypeVar, which is a lower case name, like a, used for the name of a type variable, and TypeName, which is an upper case name, like List or Foo, used to reference the name of a newly defined type.

Type variables have the variant TVar and take a name as argument. So a in Haskell would be TVar "a" in our representation.

For newly defined types, we have the name of that type, as well as a list of arguments, which are also types. List Int would be CustomType "List" [IntT] in this representation.

You can also nest things, so List (List Int) would become

CustomType "List" [CustomType "List" [IntT]]

Note that type synonyms are also under this category. If you have type X = Int, then you’d use CustomType "X" [] to refer to this type synonym.

Finally, we need a way to represent function types. We create a new operator :-> for this. (We need the : so that it can be used in types). With this in place Int -> Int becomes IntT :-> IntT, as expected. Note that we declare this operator infixr, which means it’s right associative, like -> is in Haskell. This means that:

Int :-> StringT :-> BoolT

is parsed by GHC as:

Int :-> (StringT :-> BoolT)

which is exactly the behavior that we want.

We’ll be using this representation of types throughout the compiler. Let’s go ahead and explicitly export it as well:

module Types
  ( Type (..),
    TypeVar,
    TypeName
  )
where

Now, let’s use these types to start working on our syntax tree, back in src/Parser.hs.

Defining new types

The next thing we need to do is add the data structures representing the definitions of new types in our language. There are going to be two ways to define new types:

Type Synonyms, things like type X = Int
Data Definitions, things like data Animal = Cat | Dog

In our language, these are both considered as a kind of definitions that can appear at the top level. They’ll both be variants of a common Definition type.

We’ll be needing the items we just defined in the Types module inside of the parser, so let’s add an import in src/Parser.hs:

...
import Ourlude
import Types (Type (..), TypeName, TypeVar)

First, let’s define type synonyms. Note that we don’t allow polymorphic things like type List2 a = List a, only type ListInt = List Int. Because of this, we can define type synonyms as a simple variant of Definition:

data Definition
  = TypeSynonym TypeName Type

A type synonym has a name, which is an upper case name like X, and a type that it’s equal to.

And then, we need to define the structure of data definitions. The basic idea is that we have a name and some list of constructors:

data MyType = VariantA | VariantB | VariantC

We also need the ability to have arguments in each of the constructors:

data MyType = VariantA String | VariantB Int Int | VariantC

We can also make the data type polymorphic, by specifying some type variables next to the name:

data MyType a b = VariantA String | VariantB Int Int | VariantC

Of course, we can use these type variables inside of the variants:

data MyType a b = VariantA String | VariantB Int Int | VariantC a b

And that’s about it for new data types.

Let’s first define a structure for each of the constructors / variants:

type ConstructorName = String

data ConstructorDefinition
  = ConstructorDefinition ConstructorName [Type]
  deriving (Eq, Show)

A constructor has a given name, like VariantA, or Cons, as well as a list of types, which are the arguments for the constructor. For clarity, we’ve created a separate synonym, ConstructorName, for this kind of name. For example, the VariantB constructor in our MyType example would be represented as:

ConstructorDefinition "VariantB" [IntT, IntT]

Now, we can add an extra variant in Definition for data definitions:

data Definition
  = DataDefinition TypeName [TypeVar] [ConstructorDefinition]
  | TypeSynonym TypeName Type
  deriving (Eq, Show)

A data definition has a name, a list of type variables, and then a list of constructors. Our MyType example becomes:

DataDefinition "MyType" ["a", "b"]
  [ ConstructorDefinition "VariantA" [StringT],
    ConstructorDefinition "VariantB" [IntT, IntT],
    ConstructorDefinition "VariantC" [TVar "a", TVar "b"]
  ]

in this representation.

And now we have a complete way of representing the syntax for defining new types! We’ve already started getting a feel for the tree like structure we’re using to represent our language, and we’re just getting started!

Expressions

In this section, we’ll be defining most of the different kinds of expressions in our language, except for let and where expressions. These will use definitions, which we’ll define a bit later.

Literals

This first expressions we have are literals. We have int literals, like 2 and 245, string literals, like "foo" and "bar", and finally, boolean literals, namely, True and False. We can define a type Literal representing these:

data Literal
  = IntLiteral Int
  | StringLiteral String
  | BoolLiteral Bool
  deriving (Eq, Ord, Show)

We have one type of literal for each of the primitive types in our language.

Let’s also add a type of expression for literals:

data Expr
  = LitExpr Literal
  deriving (Eq, Show)

As an example, the literal 3 becomes the much more verbose:

LitExpr (IntLiteral 3)

Names

We can also reference names inside of expressions. For example, if we have some value x :: Int, we have x + x as an expression. We can also reference the constructors for different data types, like Nil, Red, etc.

We’ll be representing these names with strings, but for clarity, let’s create a type synonym:

type Name = String

We can then use it to define a kind of expression that just use a name:

data Expr
  ...
  | NameExpr Name

Semantically, the meaning of NameExpr "foo" is whatever expression "foo" is bound to in a given scope. So if we have foo = 3, then NameExpr "foo" should evaluate to IntLiteral 3, using our expression syntax.

If Expressions

Next we have if expressions, things like:

if y then 3 else 4

if 2 + 3 > 4 then 4 else x + x

These are straightforwardly defined as:

data Expr
  ...
  | IfExpr Expr Expr Expr

We can have arbitrary expressions inside of the condition, if-branch, and else-branch.

So, if y then 3 else 4 becomes:

IfExpr
  (NameExpr "y")
  (LitExpr (IntLiteral 3))
  (LitExpr (IntLiteral 4))

Lambda Expressions

The next kind of expression are lambda expressions, allowing us to define anonymous functions. These look like:

\x -> x + 1

\x -> \y -> x + y

Note that in Haskell, you can define multiple names at once in a lambda. For example:

\x y -> x + y

is valid, and syntax sugar for:

\x -> y -> x + y

The simplifier will deal with this syntax sugar; our parser will accept this verbatim:

data Expr
  ...
  | LambdaExpr [ValName] Expr

We have a list of named parameters, and then an arbitrary expression using them. So \x y -> x becomes:

LambdaExpr ["x", "y"] (NameExpr "x")

This is the first time we use the ValName synonym, which is defined as:

type ValName = String

This will be used in many places for names that exclusively refer to values. So x is a ValName, but something like Nil would rather be typed as ConstructorName.

Function Application

What would our functional language be without function application? In Haskell, function application just takes arguments separated by whitespace:

f 1 2 3

We can also have some arbitrary expression as the function as well:

(f . g) 1 2

This gives us:

data Expr
  ...
  | ApplyExpr Expr [Expr]

Because of currying, f 1 2 is really sugar for (f 1) 2. But, once again, we’ll be removing syntax sugar in the simplifier; the parser needs to accept all of this. So, for now, f 1 2 would be represented as:

ApplyExpr
  (NameExpr "f")
  [LitExpr (IntLiteral 1), LitExpr (IntLiteral 2)]

Negation

We have a single unary operator in our language: negation. Instead of having negative numbers, like -3, this is instead the negation operator - applied to the expression 3. We can also do more complicated things like:

-((\x -> x + 1) 3)

So, we add negation expressions:

data Expr
  ...
  | NegateExpr Expr

This makes the simple -3 into the much more verbose:

NegateExpr (LitExpr (IntLiteral 3))

Binary Operators

With the only unary operator in the language out of the way, let’s add the many binary operators we need. This will cover arithmetic like a + b + c, to function composition f . g . h, to comparisons with x > y, etc.

We have:

data Expr
  ...
  | BinExpr BinOp Expr Expr

If we have some complex arithmetic expression, like a + a + b * (x + y), this can still be written as a tree, with each node only having two children:

Or, using the structure we’ve just defined, as:

BinExpr
  Add
  (BinExpr
    Add
    (NameExpr "a")
    (NameExpr "a")
  )
  (BinExpr
    Mul
    (NameExpr "b")
    (BinExpr
      Add
      (NameExpr "x")
      (NameExpr "y")
    )
  )

Of course, we haven’t defined Add or Mul, or any of the other members of BinOp yet. Let’s briefly go over all of the operators:

data BinOp
  = ...
  deriving (Eq, Show)

The + operator is used to add two numbers:

  = Add

The - operator subtracts the right number from the left number:

  | Sub

The * operator is for multiplication:

  | Mul

The / operator is for division:

  | Div

We have ++ for string concatenation

  | Concat

We have the comparision operators <, <=, >, >=, which become:

  | Less
  | LessEqual
  | Greater
  | GreaterEqual

And the comparison operators == and /= become:

  | EqualTo
  | NotEqualTo

We also have the two boolean operators ||, and &&, which are:

  | Or
  | And

Finally, we have the function “operators”. Namely, $ and .. These are defined as standard functions in normal Haskell, but since we don’t have custom operators, these are done here as well:

  | Cash
  | Compose

Pattern Matching

The next kind of expression we’re going to deal with are case expressions, things like:

case x of
  2 -> x + 2
  3 -> x + 3
  y -> x + y

case x of
  Nil -> 0
  Cons A (Cons B Nil) -> 1
  Cons B Nil -> 2
  Cons y Nil -> y

We have some kind of expression that we’re scrutinizing, followed by a series of branches. Each branch has some kind of pattern that we’re matching against, followed by an expression. The value of the case expression is the value of whatever branch happens to match.

So, let’s first define what kind of patterns we have in our language:

data Pattern
  = ...
  deriving (Eq, Show)

The first kind of pattern is the wildcard: _. This pattern matches against everything, but creates no new name:

data Pattern
  = WildcardPattern

The next kind of pattern is similar. We can also have a name as a pattern, like:

case 3 of
  x -> x

This is like the wildcard, but also assigns the value of whatever it managed to match to the given name. Here, x would have the value 3 inside of the branch.

We have:

data Pattern
  ...
  | NamePattern ValName

We use ValName, since only lower case names like x are valid here.

Another kind of pattern is for literals. We can match against literal ints, like 3, literal strings, like "foo", or literal bools, like True, or False:

data Pattern
  ...
  | LiteralPattern Literal

We also want to be able to match against custom types. If we define a custom type like data Color = Red | Blue | Green, we need to be able to do things like:

case color of
  Red -> ...
  Blue -> ...
  Green -> ...

If our data type has constructors, like data Pair = Pair Int String, we want to be able to match against the arguments as well, giving us constructions like:

case pair of
  Pair 3 "foo" -> ...
  Pair x "bar" -> ...
  Pair _ _ -> ...

With this in mind, we need to add a variant for these constructor patterns:

data Pattern
  ...
  | ConstructorPattern ConstructorName [Pattern]

We have the name of the constructor, followed by a sequence of arbitrarily complex patterns. So, given a data type like data List a = Cons a (List a) | Nil, and a pattern like:

Cons _ (Cons _ (Cons _ Nil)

we would represent this as:

ConstructorPattern
  "Cons"
  [  WildcardPattern,
     ConstructorPattern "Cons"
       [ WildcardPattern,
         ConstructorPattern "Cons"
           [ WildcardPattern,
             ConstructorPattern "Nil" []
           ]
       ]
  ]

And now that we’ve defined patterns, we can immediately define case expressions!

data Expr
  ...
  | CaseExpr Expr [(Pattern, Expr)]

We first have the expression we’re scrutinizing, followed by a list of branches. Each branch contains a pattern, and the expression to use when that pattern matches.

Given an expression like:

case scrut of
  0 -> 0
  1 -> 1
  _ -> 2

we would represent this as:

CaseExpr (NameExpr "scrut")
  [ (LiteralPattern (IntLiteral 0), LitExpr (IntLiteral 0)),
    (LiteralPattern (IntLiteral 1), LitExpr (IntLiteral 1)),
    (WildcardPattern, LitExpr (IntLiteral 2))
  ]

Definitions

Now that we’ve defined expressions as well as the structure of types, we can combine the two, in order to define the structure of value definitions. This is about the structure of defining new values. In Haskell, there are two parts to a value definition:

The type annotation, like x :: Int
The name definition itself, like x = 3

The type annotation and name definition don’t have to appear together, and the type annotation isn’t necessary at all. Regardless, our parser will need to represent both.

So, let’s create a structure for both of these:

data Definition
  = ValueDefinition ValueDefinition
  | DataDefinition TypeName [TypeVar] [ConstructorDefinition]
  | TypeSynonym TypeName Type
  deriving (Eq, Show)
  
data ValueDefinition
  = ...
  deriving (Eq, Show)

We’ve just introduced the notion of a ValueDefinition, which will contain both the type annotations, and the name definitions we’ve just touched upon. We’ve also added this to our concept of Definition. A definition now includes the two constructs related to types, namely, data definitions, and type synoyms, as well as the definitions for values.

For type annotations, the definition is pretty simple:

data ValueDefinition
  = TypeAnnotation ValName Type

A type annotation like x :: Int is represented as the name of the value being annotated, and the type we’re providing after the ::.

For name definitions, our first try might be something like:

NameDefinition ValName Expr

The idea is that something like x = 3 would become:

NameDefinition "x" (LitExpr (IntLiteral 3))

This would work for a large class of Haskell definitions, but Haskell has some more syntax sugar around definitions. The first piece of sugar is that instead of having to write:

id = \x -> x

You can instead move the parameter x to the left side of the =:

id x = x

Of course, you can have multiple parameters as well:

f a b = a + b

So, our structure for name definitions will need to accomodate these named parameters.

Another bit of sugar is that you can pattern match inside of these definitions:

f 0 1 = 1
f 0 _ = 2
f a b = a + b

A function can be defined with multiple “heads”, each of which has patterns for its arguments. Because of this, our full structure for name definitions becomes:

data ValueDefinition
  ...
  | NameDefinition ValName [Pattern] Expr

We have the name used for this value, a list of patterns for each of the arguments, and an expression for the body of this definition.

As an example, this function:

f :: Int -> Int
f 0 = 0
f a = a

Would become 3 separate definitions:

TypeAnnotation "f" (IntT :-> IntT :-> IntT)

NameDefinition "f" [LiteralPattern (IntLiteral 0)]
                   (LiteralPattern (IntLiteral 0))

NameDefinition "f" [NamePattern "a"] (NameExpr "a")

We have one definition for the type annotation, and 2 more for each of the 2 heads of the function.

Now that we have a structure for definitions, we can add the two missing types of expressions: let and where expressions. These are really two ways to express the same thing: local definitions inside an expression. So:

let x :: Int
    x = 2
in x

and

x
where
  x :: Int
  x = 2

Have exactly the same meaning. For clarity, we’ll just create two different types of expression for each of these:

data Expr
  ...
  | LetExpr [ValueDefinition] Expr
  | WhereExpr Expr [ValueDefinition]

The structure is slightly different, to reflect the different order of the two variants. let has the definitions before the expression, and where has the definitions after the expression.

Finally, now that we have definitions finished, we can actually define our real syntax tree:

newtype AST = AST [Definition] deriving (Eq, Show)

This just says that a Haskell program consists of a sequence of top level definitions. So, something like:

data Color = Red | Blue | Green

type C = Color

x :: C
x = Red

would be a series of 4 definitions, represented as:

AST
  [ DataDefinition "Color"
      [ ConstructorDefinition "Red" [],
        ConstructorDefinition "Blue" [],
        ConstructorDefinition "Green" []
      ],
    TypeSynonym "C" (CustomType "Color" []),
    ValueDefinition (TypeAnnotation (CustomType "C" [])),
    ValueDefinition (NameDefinition "x" [] (NameExpr "Color"))
  ]

Whew! We now have a complete AST representing our subset of Haskell! Let’s go ahead and add the necessary data types that our parser needs to export:

module Parser
  ( AST (..),
    ValName,
    ConstructorName,
    Name,
    Definition (..),
    ConstructorDefinition (..),
    ValueDefinition (..),
    Expr (..),
    Literal (..),
    BinOp (..),
    Pattern (..),
    parser,
  )
where

The next step in this post is to write a parser for our AST, using the little combinator framework we created earlier.

Parsing in Practice

So, now it’s time to use all of the little tools we’ve made in order to build up a parser for our AST. We’re going to be building up a parser in a bottom up fashion, starting with simple parsers for simple parts of the language, and combining them piece by piece to make up a parser for the whole language.

Names and Literals

We’re going to be starting with the simplest form of expression: names, and literals. There were two types of names in the lexer: LowerName, for names starting with a lower case letter, and UpperName for names starting with an upper case letter. We can make a parser for LowerName pretty easily:

lowerName :: Parser ValName
lowerName =
  pluck <| \case
    LowerName n -> Just n
    _ -> Nothing

Here we use the pluck :: (Token -> Maybe a) -> Parser a function we defined previously. pluck lets us match against a certain type of token, and also extract some information about that token. This is useful here, since we want to extract the string contained in a LowerName token.

We can use the same trick for UpperName:

upperName :: Parser TypeName
upperName =
  pluck <| \case
    UpperName n -> Just n
    _ -> Nothing

The only difference here is that we’re extracting from UpperName instead of LowerName.

We’ve created a few different synonyms for String, to clarify the different kinds of names we use throughout our AST: things like TypeVar, TypeName, ValName, etc. We can immediately put the two parsers we’ve just defined to use, by creating a parser for each of these syntactic classes:

typeVar :: Parser TypeVar
typeVar = lowerName

typeName :: Parser TypeName
typeName = upperName

valName :: Parser ValName
valName = lowerName

constructorName :: Parser ConstructorName
constructorName = upperName

Each of these kinds of name is either a lower case name, or an upper case name. In some contexts, for example, function application, we might need to accept both, which is why we created the Name synonym. We can add a parser for that as well:

name :: Parser Name
name = valName <|> constructorName

Here we use the <|> for the first time, to indicate that name is either a valName, or a constructorName, i.e. that we should accept either the LowerName token, or the UpperName token here.

The next thing we can add in this section is literals. We want to be able to parse ints, strings and bools, which make up the built-in values in our subset of Haskell:

literal :: Parser Literal
literal = intLit <|> stringLit <|> boolLit
  where

Once again, we use the handy <|> operator, to say that a literal is either an integer literal, a string literal, or a bool literal. The lexer has already done the job of creating a token for each of these, which makes defining these alternatives very easy:

  where
    intLit =
      pluck <| \case
        IntLit i -> Just (IntLiteral i)
        _ -> Nothing
    stringLit =
      pluck <| \case
        StringLit s -> Just (StringLiteral s)
        _ -> Nothing
    boolLit =
      pluck <| \case
        BoolLit b -> Just (BoolLiteral b)
        _ -> Nothing

We use the very handy pluck once again, in order to match against the right kind of literal token, and pick out the data it contains. We then make sure to wrap this data in the right kind of literal for our AST. Int literals become IntLiteral i, strings StringLiteral s, bools BoolLiteral b, as we went over before.

Parsing Operators

The next thing we’ll add support for is parsing operators. This requires a bit of thought before we get to writing out our parser though.

Operator Precedence

A first approach towards operators would be to treat them as a kind of punctuation. If you have something like:

x + y - a + 3

The names and literals are like words composing this arithmetic sentence, and the arithmetic operators are like the punctuation between the words. A first approach would start with some parser for these bottom level words. Then you’d use this parser, separated by the arithmetic tokens:

One problem with this approach is precedence. For example, multiplication with * has higher precedence than addition with +. This means that an expression like:

1 + 2 * 3 + 3 * 5

should parse as:

((1 + (2 * 3)) + 3) * 5

and not:

((((1 + 2) * 3) + 3) * 5)

We would get the latter if we parsed addition and multiplication with the same precedence. One way to fix this is to introduce multiple layers, with one layer for each level of precedence:

Addition would parse multiplication, separated by +, and multiplication would parse literal factors, separated by *. This paradigm allows us to skip levels too. A single item like 3 falls under the “at least one number separated by *” category: we just have the single number.

Function application falls under this paradigm as well. We can treate function application as a very high precedence kind of operator. Instead of parsing “at least one <sub-factor>, separated by <operator>”, we just parse “at least two <sub-factor>s”. (This makes sure that, only f x is considered as a function application, since f alone would just be a name).

With multiplication, addition, and function application, our tower of operators now looks like this:

One aspect is missing though: parentheses! In languages with arithmetic expressions like these, you can manually insert parentheses to override the grouping rules.

For example, this expression:

(1 + 2) * 3

is not the same thing as:

1 + 2 * 3

precisely because the parentheses have been inserted.

There’s a simple rule to make this work. Instead of having the bottom of our tower only have dead-end rules like literals, or simple names, we can instead add a rule that goes all the way to the top, parsing any expression surrounded with parentheses:

Parsing some expression like f (1 + 2) will work, because we end up with a transition like this:

This is also useful beyond arithmetic, since we wanted to allow arbitrary expressions inside of parentheses anyways. As bizarre as it may seem, something like:

function (
  case x of
    1 -> 2
    _ -> 3
)

is a valid bit of Haskell. This precedence looping trick will allow us to parse this kind of thing once we expand our expressions to accept more than just operators.

Using this approach, a full precedence tower including all of the operators in our language looks like this:

(See this link for a good overview of the precedence of different operators).

With that in mind, have enough thinking done to start implementing this precedence tower.

Operator Utilities

The parsing for each layer of operator will be very similar, so we can go ahead and define a handful of utilities we’ll be using to build them up.

The first utility is to take a parser for <item> and turn it into a parser for ( <item> ). This is useful for implementing the looping in our tower of precedence. This utility just requires matching () tokens before and after the original parser:

parensed :: Parser a -> Parser a
parensed p = token OpenParens *> p <* token CloseParens

Here we use the convenient *> and <* operators:

(*>) :: Applicative f => f a -> f b -> f b
(<*) :: Applicative f => f a -> f b -> f a

These sequence their arguments in the same order as <*>, but pick a different result. For our parensed parser, this does the right thing, parsing the ( before our original parser, and the ) after, but leaving us with just the original result.

When thinking about parsing operators, we had the nice of idea of doing something like “parse at least one factor, separated by an operator”. We can make a utility that does this:

sepBy1 :: Parser a -> Parser sep -> Parser [a]

Given a parser for a, and parser for sep, this creates a parser that parses one or more occurrences of a, separated by sep. So, something like sepBy1 literal (token Semicolon) would parse things like 3;1;"foo".

The implementation makes straightforward use of a few utilities for Alternative:

sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = liftA2 (:) p (many (sep *> p))

Semantically, we first parse p, and then zero or more occurrences of sep followed by p. So:

<p>
<p> <sep> <p>
<p> <sep> <p> <sep> <p>

are all going to be accepted by this parser. many :: Alternative f => f a -> f [a] is provided to us, since parsers implement the Alternative class.

We use sep *> p each time, since we just want the item parsed, and not the separator. Finally, we use liftA2 (:) to cons the first item parsed, with the many items produced by the rest of the parser.

Now, this would work for something simple like multiplication, since we could parse out many multiplication factors, separated by *. One slight problem is that some operators have the same precedence. For example, + and -. We need a modified version of sepBy1 where we remember the separators we see. In fact, it would be even more useful if given something like:

1 + 2 - 3 + 4

our parser could recognize this as:

1 (+, 2) (-, 3) (+, 4)

and then build up the correct parse tree, using each operator:

((1 + 2) - 3) + 4

This inspires the following type for parsing operators:

opsL :: Parser (a -> a -> a) -> Parser a -> Parser a

The first parser is used for separators, and the second parser is used for the factors between separators. Semantically, this acts like sepBy1, with the arguments reversed. The key difference is that our separator parser produces a function used to combine the items to its left and right. For example, for parsing addition and subtraction, we could use this as our separator:

(BinExpr Add <$ token Plus) <|> (BinExpr Sub <$ token Dash)

If we see a +, we combine things as an addition expression, and if we see a -, then we combine things with a subtraction expression.

The implementation is similar to sepBy1

opsL :: Parser (a -> a -> a) -> Parser a -> Parser a
opsL sep p = liftA2 squash p (many (liftA2 (,) sep p))
  where
    squash = foldl' (\acc (combine, a) -> combine acc a)

Instead of discarding the separators, we now keep them alongside the tokens:

Then, we fold from the left, using the combinator that’s now attached to each token we encounter:

In practice, this works out nicely. Given some arithmetic expression like:

a + b + c - d

this groups things in the way we’d expect:

((a + b) + c) - d

One problem is that this works for left associative operators, since we have our large expression on the left, and add items as we scan towards the right.

Some operators are not left associative though! For example, the function arrow -> between types associates to the right:

For example, this snippet:

Int -> String -> Int

means:

Int -> (String -> Int)

and not:

(Int -> String) -> Int

We need to create an opsR function, that does the same thing as opsL, but associates to the right instead:

opsR :: Parser (a -> a -> a) -> Parser a -> Parser a

The idea is that given some sequence, like:

A -> B -> C -> D

We end up with:

after grouping the items and the separator, like we did in opsL.

Next, let’s flip this sequence around, before moving the separator to the right one step:

Now we also need to flip the operator, to make sure that the arguments are passed in the right order.

And then our scanning approach from last time works:

With this in mind, here’s the implementation of opsR:

opsR :: Parser (a -> a -> a) -> Parser a -> Parser a
opsR sep p = liftA2 squash p (many (liftA2 (,) sep p))
  where
    shift (oldStart, stack) (combine, a) =
      (a, (combine, oldStart) : stack)

    squash start annotated =
      let (start', annotated') = foldl' shift (start, []) annotated
       in foldl' (\acc (combine, a) -> combine a acc) start' annotated'

Our starting point is the same as with opsL, having parsed our first item, and the following items along with their separators. From there, we do a first scan, where we want to end up with each separator attached to the item on its left, and the last item in our starting list alone.

We do this with foldl', keeping track of the last item we’ve seen, and the tail of items along with their new separators. When we see a new (sep, a) pair, a becomes our new last item, and the previous item is paired up with sep, and put onto the tail.

Finally, we can scan over this list as if we were doing a left associative operator, with the caveat that we swap the arguments to combine.

This is because we need to apply flip combine in practice, and not just combine, to accomplish the “operator flipping” we went over in the previous diagram.

In practice, this means that if we have:

A -> B

which we’ve separated into:

(B, [(->, A)])

the -> function is still expecting the accumulator on its right, to make, A -> B.

This took me a while to get a hold of as well, but thinking through the illustrations really helps make sense of how this operation works.

Operators in Practice

At the top of our precedence hierarchy, we have expressions:

expr :: Parser Expr
expr = binExpr

For now, all the expression we have are binary expressions. This is where we’ll have all of our operator parsing.

We have:

binExpr :: Parser Expr
binExpr = cashExpr
  where
    cashExpr = opsR (BinExpr Cash <$ token Dollar) orExpr

The lowest precedence operator is at the top: the ubiquituous $, of course.

Next we have ||:

    orExpr = opsR (BinExpr Or <$ token VBarVBar) andExpr

And then we have &&:

    andExpr =
      opsR (BinExpr And <$ token AmpersandAmpersand) comparisonExpr

Next we have comparisons, i.e. <, <=, >, >=, ==, /=:

    comparisonExpr = opsL comparisonOperator concatExpr
      where
        comparisonOperator =
          (BinExpr Less <$ token LeftAngle)
            <|> (BinExpr LessEqual <$ token LeftAngleEqual)
            <|> (BinExpr Greater <$ token RightAngle)
            <|> (BinExpr GreaterEqual <$ token RightAngleEqual)
            <|> (BinExpr EqualTo <$ token EqualEqual)
            <|> (BinExpr NotEqualTo <$ token FSlashEqual)

Here we have different possibilities, since all of these operators have the same precedence. For each of the possible tokens, we make sure to associate the correct binary expression. Each operator parser always yields exactly the combining function that it represents, thanks to our use of <$.

Continuing on, we have ++, for string concatenation:

    concatExpr = opsL (BinExpr Concat <$ token PlusPlus) addSubExpr

Followed by addition and subtraction with + and -:

    addSubExpr = opsL addSubOperator mulDivExpr
      where
        addSubOperator =
          (BinExpr Add <$ token Plus)
            <|> (BinExpr Sub <$ token Dash)

Multiplication and division, with * and /, are a bit stronger:

    mulDivExpr = opsL mulDivOperator composeExpr
      where
        mulDivOperator =
          (BinExpr Mul <$ token Asterisk)
            <|> (BinExpr Div <$ token FSlash)

And finally, the highest precedence binary operator is function composition with .:

    composeExpr = opsR (BinExpr Compose <$ token Dot) unaryExpr

Next come unary operators, of which we only have -x, for integer negation. Note that -(+1) 3 is valid haskell, and evaluates to -4. This still looks a bit weird to me, but is pretty simple to handle:

unaryExpr :: Parser Expr
unaryExpr = negateExpr <|> appExpr
  where
    negateExpr = (token Dash *> appExpr) |> fmap NegateExpr

To parse a unary expression, we either find a - and a function application expression, or just skip the - entirely.

appExpr is going to parse something like f a b c, and decide whether or not it’s a function, based on how many things it sees:

appExpr :: Parser Expr
appExpr = some factor |> fmap extract
  where
    extract [] = error "appExpr: No elements produced after some"
    extract [e] = e
    extract (e : es) = ApplyExpr e es

If we see a single expression, then we leave that alone. Something like 3 by itself just means 3. If we see f 3, i.e. two or more items, then the first of those items is a function, and the rest are its arguments.

The some function provided by Alternative parses one or more occurrences of a parser, so if it produces an empty list, that’s not our fault, and something very fishy is going on.

This is why some people advocate for more usage of the NonEmpty type in Haskell, which would allow many and some to indicate how they differ in terms of their types:

some :: f a -> f (NonEmpty a)
many :: f a -> f [a]

And finally, we have factor, which is at the bottom of our precedence hierarchy:

factor :: Parser Expr
factor = littExpr <|> nameExpr <|> parensed expr
  where
    littExpr = fmap LitExpr literal
    nameExpr = fmap NameExpr name

A factor is either a simple literal, a name, or it pulls off the beautiful trick we went over earlier: it expects some parentheses, and jumps all the way back to the top of the expression hierarchy!

Other Expressions

The next thing we’re going to tackle is parsing expressions that don’t contain definitions, i.e. if, case and lambdas.

If expressions and lambdas are pretty simple, so let’s get them out of the way:

expr :: Parser Expr
expr = binExpr <|> ifExpr <|> lambdaExpr
  where

For if, it’s pretty much what it says on the tin:

    ifExpr =
      IfExpr
        <$> (token If *> expr)
        <*> (token Then *> expr)
        <*> (token Else *> expr)

We parse the condition after an if token, the first branch after a then token, and the other branch after an else token. We use the standard <$> and <*> operators to plumb these results into the IfExpr constructor, which takes these three operands.

For lambdas, it’s also not that complicated. A lambda like:

\x y -> x + y

is just the \ token, some names, a -> token, and then an expression. In code, we get:

    lambdaExpr = token BSlash *>
      liftA2 LambdaExpr (some valName) (token ThinArrow *> expr)

For case expressions, we first need a way to parse patterns. Parsing patterns is kind of similar to operators. We can have a constructor with simple arguments:

C 1 2 3 4

but in order to introduce more complicated patterns as arguments, we need parentheses:

C (A 2) 2 3 B

Notice how the B constructor takes no arguments, and can appear without as a simple argument here.

Our definition for a single pattern looks like this:

onePattern :: Parser Pattern
onePattern = unspacedPattern <|> argfulPattern
  where
    argfulPattern =
      liftA2 ConstructorPattern constructorName (some unspacedPattern)

A pattern is either a constructor taking some arguments, or a single argument pattern. unspacedPattern will parse any kind of pattern that can appear as an argument to a constructor, perhaps requiring parentheses:

unspacedPattern :: Parser Pattern
unspacedPattern = simplePattern <|> parensed onePattern
  where

So, an unspacedPattern is either an entire pattern, wrapped in parentheses, or a simple pattern, which can appear as an argument without any parentheses:

    simplePattern =
      singleConstructor
        <|> wildCardPattern
        <|> varPattern
        <|> litPattern
    singleConstructor = fmap (`ConstructorPattern` []) constructorName
    wildCardPattern = WildcardPattern <$ token Underscore
    varPattern = fmap NamePattern valName
    litPattern = fmap LiteralPattern literal

A pattern like this is either a constructor with no arguments, like B in the example before, a wildcard pattern _, a single named pattern, like x, or a literal, like 3 or "foo".

With patterns defined, we can start working on case expressions. Note that from the point of view of the parser, case expressions use braces and semicolons:

case x of {
  1 -> 2;
  _ -> 3
}

This kind of situation will appear with definitions as well, so we can go ahead and make a utility for parsing braces and semicolons:

braced :: Parser a -> Parser [a]
braced p =
  token OpenBrace *> sepBy1 p (token Semicolon) <* token CloseBrace

braced turns a parser for a single item, into a parser for one or more items separated by semicolons, all inside of braces.

We can then immediately used braced to create a parser for case expressions:

caseExpr :: Parser Expr
caseExpr =
  liftA2 CaseExpr (token Case *> expr <* token Of) (braced patternDef)
  where
    patternDef = liftA2 (,) onePattern (token ThinArrow *> expr)

We have the case <scrutinee> of part, followed by the patterns, which are in braces and semicolons. We than have the <pat> -> <branch> part. Each pattern is just the onePattern we defined before, followed by the -> token, and the expression for that branch.

We could add this to the definition of expr, but we’re going to be adding a few more expressions shortly, so we can just include it when we get around to those.

Types

The remaining expressions are those related to definitions: things like x = 2, or x :: Int. We can parse the expression part of these value definitions just fine, but we haven’t added any parsing for types yet! We’ll take care of this now, before moving on to definitions.

Parsing types is going to be very similar to parsing patterns. We have some types taking in arguments, like Pair Int String, or List (List Int). We need to distinguish the types that are basic enough to appear as arguments without parentheses, like we did with patterns.

We’ll also need to add in a bit of operator parsing, to parse function types like Int -> String -> Int.

Our goal is to define:

typeExpr :: Parser Type

which parses an arbitrary type.

Starting from the bottom, we have:

singleType :: Parser Type
singleType = fmap TVar typeVar <|> primType <|> parensed typeExpr
  where
    primType =
      (IntT <$ token IntTypeName)
        <|> (StringT <$ token StringTypeName)
        <|> (BoolT <$ token BoolTypeName)

The singleType represents either a fundamental type that is not a kind of custom type, or an arbitrary type, so long as it’s between parentheses.

We can then use this to define a full parser for type expressions:

typeExpr :: Parser Type
typeExpr = opsR ((:->) <$ token ThinArrow) baseType
  where
    baseType = singleType <|> customType
    customType = liftA2 CustomType typeName (many typeArgument)

typeArgument :: Parser Type
typeArgument = namedType <|> singleType
  where
    namedType = fmap (`CustomType` []) typeName

We have different type “factors”, separated by the function arrow ->. This becomes the constructor :->, representing the same thing in our syntax tree. Of course, we make sure that we parse this operator in a right associative way, so that:

Int -> String -> Int

is parsed as:

IntT :-> (StringT :-> IntT)

The base type that can appear between functions is either the single type we defined earlier, or a custom type, which we excluded from the definition of singleType. A custom type is parsed as some type name, followed by zero or more arguments.

For the arguments, we do something interesting. We accept singleType, as expected, but we also accept namedType. namedType is just a single custom type, without any arguments. This allows us to have a type like T as an argument, giving us something like Pair T T, without needing to wrap T in parentheses.

Value Definitions

With that in place, we now have everything we need to parse value definitions. As a reminder, these look like:

x :: Int
x = 3

We have either a type annotation, or a name definition. The parser for this is pretty straightforward:

valueDefinition :: Parser ValueDefinition
valueDefinition = nameDefinition <|> typeAnnotation
  where
    nameDefinition =
      NameDefinition
        <$> valName
        <*> many unspacedPattern
        <*> (token Equal *> expr)
    typeAnnotation =
      liftA2 TypeAnnotation valName (token DoubleColon *> typeExpr)

A valueDefinition is either a nameDefinition, or a typeAnnotation.

A typeAnnotation is a name, followed by a :: and some type.

A nameDefinition is also a name, but this time with multiple patterns as arguments, the = token, and then expression. We have multiple patterns, since Haskell allows the syntax sugar:

f a b = a + b

for:

f = \a b -> a + b

These are unspacedPattern, since if we need complex patterns, we need them to appear with parentheses. Something like:

f Pair a b =

will be parsed as:

f (Pair) (a) (b) =

instead of:

f (Pair a b) =

which requires explicit parentheses.

Remaining Expressions

With these value definitions in place, we’re now ready to finish defining all of our expressions. The remaining expressions we need to add to expr are where and let expressions:

expr :: Parser Expr
expr = notWhereExpr <|> whereExpr
  where
    notWhereExpr =
      letExpr <|> ifExpr <|> lambdaExpr <|> binExpr <|> caseExpr
    ifExpr =
      IfExpr
        <$> (token If *> expr)
        <*> (token Then *> expr)
        <*> (token Else *> expr)
    lambdaExpr = token BSlash *>
      liftA2 LambdaExpr (some valName) (token ThinArrow *> expr)
    letExpr = liftA2 LetExpr
      (token Let *> braced valueDefinition) (token In *> expr)
    whereExpr = liftA2 WhereExpr
      notWhereExpr (token Where *> braced valueDefinition)

It should be clear that notWhereExpr <|> whereExpr includes every kind of expression, unless you’re a constructive mathematician.

For let expressions, we just parse the token let, followed by some value definitions in braces and semicolons, the in token, and then an expression.

Remember that let introduces a layout, so:

let
  x :: Int
  x = 2
in x

is lingo for:

let {
  x :: Int;
  x = 2 
} in x

Now, for where expressions, it’s tempting to do something like:

expr *> token Where *> braced valueDefinition

but the problem is that this parser recurses infinitely. This is an issue we haven’t touched upon yet, called left recursion. This arises in theory for parsing methods like ours, and for parser combinators in particular, even in a lazy language like Haskell.

The fundamental problem is that if we have a rule like this, to parse expr, we try parsing expr, which means that we try to parse expr, and so we see if we can parse expr, and we give a shot at parsing expr, and then, out of curiosity, see if maybe parsing expr might work, etc.

We aren’t able to make any progress, because we have no concrete rules to latch on to. The solution is to not have expr on the left here, but notWhereExpr instead. This way, if our parser starts “exploring” this path, it can’t immediately loop back onto itself, but instead has to parse a different rule that will allow it to make progress.

This is why we define notWhereExpr to include everything except whereExpr, to prevent this kind of immediate loop, with no tokens in between.

Anyhow, with this trick resolved, we have now finished parsing all of the expressions in our language. The end is in sight!

Other Definitions

The only remaining items to parse in our syntax tree are custom type definitions, and type synonyms, which make up the remaining kinds of definition in our language.

For type definitions, we’ll need a parser for each of the constructors that make up definitions like:

data VariantA = A Int String | B String Int

This gives us:

constructorDefinition :: Parser ConstructorDefinition
constructorDefinition =
  liftA2 ConstructorDefinition constructorName (many typeArgument)

We can reuse the typeArgument we defined earlier, which was used for the simple types that appear as arguments to a custom type, like Pair Int (List String). Of course, if we have a constructor:

PairConstructor Int (List String)

then it’s natural to reuse the same parsing scheme for its arguments.

A constructor doesn’t necessarily have any arguments, hence the many, instead of some.

Finally, we can use this to parse the remaining definitions:

definition :: Parser Definition
definition = valueDefinition' <|> dataDefinition <|> typeSynonym
  where
    valueDefinition' = fmap ValueDefinition valueDefinition
    dataDefinition =
      DataDefinition
        <$> (token Data *> typeName)
        <*> many typeVar
        <*> (token Equal *> sepBy1 constructorDefinition (token VBar))
    typeSynonym = liftA2 TypeSynonym
      (token Type *> typeName) (token Equal *> typeExpr)

A definition is either a value definition, a data definition, or a type synonym.

Type synonyms are pretty simple. We have a type token, the name of the synonym, and then =, followed by the type we’re making an alias of.

For data definitions, we first have the data token, followed by the name of the new custom type, and any type variables it might have. We have then have an = token, followed by its constructors. There must be one or more constructors, each of them separated by a | token.

And now we can parse all of the definitions in our language. We just have a bit of plumbing left to do to connect all of this together.

Gluing it all together

A Haskell program consists of a series of top level definitions. The whole file is actually in a braced environment, which means that we have the following parser for the entire AST:

ast :: Parser AST
ast = fmap AST (braced definition)

These definitions include both value definitions, so definition values by assigning them expressions, or annotating their types, as well as defining new types, through type synonyms, or data definitions.

We also need to amend our ParseError type to be a bit more informative:

data ParseError
  = FailedParse
  | AmbiguousParse [(AST, [Token])]
  deriving (Show)

While still not as informative as you really would like, we now have two kinds of errors. The first occurrs if our parser found no way to parse the tokens given to us, and the second occurrs if multiple possible ASTs have been produced after going through all of the tokens.

I’ve mentioned ambiguity being a concern when designing some aspects of the parser, and avoiding this kind of error is the reasoning behind some of the design.

With these errors in mind, we can amend our parser function to use the ast parser we just defined:

parser :: [Token] -> Either ParseError AST
parser input = case runParser ast input of
  [] -> Left FailedParse
  [(res, _)] -> Right res
  tooMany -> Left (AmbiguousParse tooMany)

We use runParser :: Parser a -> [(a, [Token])] to get a list of parse results, after running the parser for our entire syntax tree on the input tokens. If we have no possible parses, then that’s the first error we defined above. If we have too many parses, that’s the ambiguous error. Finally, if we just have a single result, then that’s the AST that we want our parser to return.

And with that, we’ve finished our parser! It’s time to pat ourselves on the back, and try it out with a few examples.

Examples

If we run

cabal run haskell-in-haskell -- parse file.hs

we’ll be able to see the parse tree for a given source program.

For something simple like:

x :: Int
x = 2

we get:

AST
    [ ValueDefinition
        ( TypeAnnotation "x" IntT )
    , ValueDefinition
        ( NameDefinition "x" []
            ( LitExpr
                ( IntLiteral 2 )
            )
        )
    ]

which is quite a bit more verbose, but matches what we expect. We see the definition of x pop up, first as a type annotation, and then once more, as the actual expression.

We can make more complex arithmetic of course:

x :: Int
x = 1 + 2 * 3

This program gives us:

AST
    [ ValueDefinition
        ( TypeAnnotation "x" IntT )
    , ValueDefinition
        ( NameDefinition "x" []
            ( BinExpr Add
                ( LitExpr
                    ( IntLiteral 1 )
                )
                ( BinExpr Mul
                    ( LitExpr
                        ( IntLiteral 2 )
                    )
                    ( LitExpr
                        ( IntLiteral 3 )
                    )
                )
            )
        )
    ]

We can also do let expressions:

x :: Int
x =
  let y = 2
  in y

AST
    [ ValueDefinition
        ( TypeAnnotation "x" IntT )
    , ValueDefinition
        ( NameDefinition "x" []
            ( LetExpr
                [ NameDefinition "y" []
                    ( LitExpr
                        ( IntLiteral 2 )
                    )
                ]
                ( NameExpr "y" )
            )
        )
    ]

This works for even more complicated programs, of course, but I’ll spare you the very large syntax trees

As a final example, let’s do some basic pattern matching with a custom type:

data Color = Red | Blue | Green

f Red = 0
f Blue = 1
f Green = 2

AST
    [ DataDefinition "Color" []
        [ ConstructorDefinition "Red" []
        , ConstructorDefinition "Blue" []
        , ConstructorDefinition "Green" []
        ]
    , ValueDefinition
        ( NameDefinition "f"
            [ ConstructorPattern "Red" [] ]
            ( LitExpr
                ( IntLiteral 0 )
            )
        )
    , ValueDefinition
        ( NameDefinition "f"
            [ ConstructorPattern "Blue" [] ]
            ( LitExpr
                ( IntLiteral 1 )
            )
        )
    , ValueDefinition
        ( NameDefinition "f"
            [ ConstructorPattern "Green" [] ]
            ( LitExpr
                ( IntLiteral 2 )
            )
        )
    ]

We first have the definition for this custom type, with its three different constructors. We then have three separate definitions for the function f, since it has three different heads. Each of those heads has a pattern it matches against, which we see in the NameDefinition node as well.

Anyways, it’s a lot of fun to play around with this for different programs, and I’ve gotten distracted playing around with it again, even through I wrote this parser months before I got around to writing this post.

Conclusion

Hopefully this was an interesting post, as we get deeper and deeper into the guts of the compiler. Hopefully the little sprinkling of theory was enough to make the implementation understandable, but this post did end up being longer than I expected it to be.

The full code for this part, and the entire compiler are available, as a reference.

As usual, Crafting Interpreters has a great chapter on parsing, which might be a good read if you’re still not sure how exactly parsers work, even if the combinator approach works intuitively.

If you want a meatier text that goes into different theoretical methods of parsing, including thornier issues like left recursion, I like the book Engineering a Compiler. The third chapter of this book is about parsing, and covers all of this theory in great detail.

In this next post, we’ll be going over the simplifier, to prepare our syntax tree for consumption by our type checker. The simplifier will have some drudge work to do, like resolving information about the type synonyms used in our program, and the signatures of each constructor. Another super interesting part is removing all of the nesting from pattern matching, making the rest of the compiler much simpler.

Anyways, I’m getting ahead of myself, but that’s a bit of a preview for the next part in this series. At the rate it takes me to write these, this’ll most likely be coming around next year.

So, I wish you all a (belated) Merry Christmas, and a Happy New Year!