(Haskell in Haskell) 2. Lexing

This is the second post in the Haskell in Haskell series.

In this post, we’ll go over creating a lexer for our subset of Haskell. This stage of the compiler goes from the raw characters in our a source file, to a more structured stream of tokens.

We’ll finally be digging into some serious code this time, so get your editor ready if you’re following along!

The full code for this part is available here, for reference.

1 mile overview

Before we go dive into the code, let’s first get an idea of what this part is about. What job is the lexing phase trying to accomplish? What is it responsible for? What kind of output will we produce?

Making Tokens

The first goal of the lexer is to take our raw source code, a blob of characters (String, in Haskell), and figure out how to group these characters to make the Parser’s job easier.

As a reminder, the Parser is going to take our tokens, and then produce a complete representation of the source code. The Parser could work directly on characters, but this would be more complicated. We would be worrying about character level concerns, at the same time as higher level structure. Having a separate lexing phase gives us a nice separation of concerns here.

The Parser will need to use context to do its job. For example, when parsing an expression, it knows not to try and read out things related to types. On the other hand, the lexer (at least the part that spits out tokens) can be made to work without context.

The lexer just reads in characters, and spits out tokens. After spitting out a token, the lexer is in the exact same “state” that it started with. The lexer has minimal knowledge about context.

What tokens?

So far we’ve talked about “tokens”, or “clusters”. A good analogy is that our source code is like a sentence in English: the sentence is composed of raw characters, ditto for source code. The words in the sentence make up the tokens in our source code. And finally, the parse tree gives us the meaning of the sentence, or the source code.

So, we want to split up our source code into the tokens that compose it. What kind of tokens will we end up using?

As an example, consider this simple definition:

myDefinition = 232 + y2

In terms of raw characters, we have:

m y D e f i n i t i o n  =  2 3 2  +  y  2

In terms of tokens, we’d like to end up with:

Here, we’ve first separated out an identifier, that must be taken as a whole. It makes no sense to process myDefinition as anything but a single unit in the parser. We also have two operators, =, and +. We also have a literal, 232. It doesn’t make sense to treat a literal as anything but a unit.

We’ll be going over the details of what tokens we’d like to define, and how exactly we’re going to go about lexing them out of our source code later. But keep this idea in mind: we’re trying to separate out our source code into little units, all to make our Parser much simpler.

Generating Whitespace

Our lexer will also have a job much more specific to Haskell. Haskell has one critical feature that makes parsing it a bit tricky: whitespace.

We’re quite used to the whitespace sensitivity as Haskell programmers. Most programming languages nowadays don’t even require semicolons, even if they do still use braces for layout. On the other hand, most people don’t know how Haskell actually uses braces and semicolons out of the hood.

For example

let { x = 2; y = 3 } in x + y

Is a perfectly valid expression. However, most people would write this as:

let x = 2
    y = 3
in x + y

and they may not even be aware that you could use the braces and semicolons instead. In fact, not only is the braced version an option, but the whitespaced version is nothing more than syntax sugar for this way of doing things.

When Haskell sees:

let x = 2
    y = 3
in x + y

It understands that you really mean:

let {
  x = 2;
  y = 3
}
in x + y

and infers those pesky braces and semicolons accordingly.

In fact, when you have multiple definitions at the top level, like:

x = 2
y = 3
z = 4

This is also lingo for a braced version:

{
x = 2;
y = 3;
z = 4
}

What we’re going to be doing is adding an extra mini-stage that looks at the whitespace in our program, and inserts the correct semicolons and braces to reflect the meaning of the indentation.

We could do this as part of the parser, but this would make the parser much more complicated than it otherwise needs to be. Doing this as an extra step between the more traditional lexer and the parser makes things much cleaner, and much more maintainable.

If you don’t understand exactly what constructs end up using indentation to insert braces and semicolons, that’s normal: I haven’t explained it yet :). We’ll be going over this whole business in much more detail when we implement this system, so that it’s fresh in our heads.

Our First Stage

So, having seen what we’re going to be accomplishing in this part, let’s actually start the scaffolding for what we want to do.

A trivial Lexer

The first thing we want to do is to create a trivial lexer. This will serve as a stubholder for now. Our goal here is to write the command line wrapper around this code first, and then expand it to make an actual lexer later.

Anyhow, we need a Lexer module, so let’s add that to our exposed-modules in our .cabal file:

library
  ...
  exposed-modules: Ourlude
                 , Lexer
  ...

And then, let’s create a file for this module, in src/Lexer.hs:

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE TupleSections #-}
module Lexer (Token(..), lexer) where

import Ourlude

We’re using two extensions here to add some extra features to the code in this module.

The first LambdaCase, allows us to replace expressions like:

\x -> case x of
  Foo -> ...
  Bar -> ...

with:

\case
  Foo -> ...
  Bar -> ...

which explains the naming of this feature.

The second, TupleSections allows us to use partially apply tuples, in the same way you can do operators. For example, you can do:

map (,3) ["A", "B"]
[("A", 3), ("B", 3)]

Without the extension, you’d have to write the less terse: \x -> (x, 3) instead.

Both of these extensions are included in this module because they simplify some code here and there. LambdaCase, in particular, will even be used pretty heavily throughout the project.

We also have import Ourlude, in order to import the custom prelude we defined last time.

For now, let’s create a “stub” implementation of our lexer:

data LexerError = UnimplementedError deriving (Eq, Show)

data Token = NoTokensYet deriving (Eq, Show)

lexer :: String -> Either LexerError [Token]
lexer _ = Left UnimplementedError

These are the types that our lexer will eventually end up producing. Given a string, containing the source code we need to lex, we should either produce an error, of type LexerError, or the list of tokens we managed to lex.

Right now, we just have a dummy token type, and an error indicating that our lexer has not been implemented yet.

Making a CLI program

Before we actually implement these functions, we’re first going to integrate the lexer into our command line program, so that the user can run it on an actual file. We can reuse this boilerplate for the other stages of our compiler as well. With this out of the way now, we can focus on the guts of the compiler as we move forward.

First, let’s add the imports we’ll be needing in Main.hs:

import Control.Monad ((>=>))
import Data.Char (toLower)
import qualified Lexer
import Ourlude
import System.Environment (getArgs)
import System.Exit (exitFailure)
import Text.Pretty.Simple (pPrint, pPrintString)

We import our Lexer module, since we’ll be using that here. We also import our prelude, of course, as well as some small helpers like >=>, and toLower. We have some IO stuff we’ll be using like getArgs :: IO [String] for getting the arguments passed to our program, as well as exitFailure :: IO (), to quit the program if an error happens.

pPrint and pPrintString come from the library we mentioned last time , which can format any given string in a nice way.

Stages: The Idea

We’re going to make a bit an abstraction over the concept of a stage. Our compiler will be composed of various stages, such as the lexer, the parser, etc. Each stage takes in some input, and produces some output, potentially failing with an error. Our lexer takes in a String as input, produces [Token] as output, and fails with LexerError. Our stage abstraction should represent this concept of input, output, and errors.

We also want to be able to produce some kind of function, or IO procedure to execute, knowing up to which stage we want to proceed. So if the user does haskell-in-haskell lex foo.hs, we know that they want to stop and see the output of the lexing stage, whereas haskell-in-haskell type foo.hs would mean stopping after the type-checking phase, etc. We need to use our stage abstraction to construct some kind of IO action, and figure out which of these actions to run based on this first argument. So if we see lex, we run lexAction, if we see type we run typeAction, etc.

Stages: The Implementation

The first thing we need is some kind of opaque error type. We want this so that our stage abstraction doesn’t have to keep track of the error type, for one, and also so that we can annotate the errors produced by that stage with some metadata, like the name of that stage.

So, in Main.hs, let’s create:

data StagedError = StagedError String String

Unfortunately, the two arguments are easily confused, but the first will be the name of the stage, and the second will be the error produced by a given stage, already converted to a String.

We want to be able to transform Either ThisStageError ThisStageInput to use this error convention, so let’s create:

stageEither :: Show e => String -> Either e a -> Either StagedError a
stageEither name m = case m of
  Left e -> Left (StagedError name (show e))
  Right a -> Right a

Given a name (which is presumably that of a given stage), we can massage the error into a StagedError by showing the error, and putting the name beside it.

We want to be able to print this StagedError type in a nice way, so let’s create:

printStagedError :: StagedError -> IO ()
printStagedError (StagedError name err) = do
  putStr name
  putStrLn " Error:"
  pPrintString err

This uses pPrintString from the pretty-simple library to print out the error in a nice way.

With this error type in hand, we can create our stage abstraction:

data Stage i o = Stage
  { name :: String,
    runStage :: i -> Either StagedError o
  }

So, Stage is parameterized over an input type i, and an output type o. We use this to create a function taking input i, and producing output o, potentially failing with a StagedError. We also have the name of the stage as an extra bit of metadata.

We might have Stage String [Token] for our Lexer, Stage [Token] SyntaxTree for our Parser, etc.

Our lexer is of the form String -> Either LexerError [Token]. If we squint at this a bit, we see i -> Either e o. We want to transform these kinds of function into Stage, so let’s create a little helper:

makeStage :: Show e => String -> (i -> Either e o) -> Stage i o
makeStage name r = Stage name (r >>> stageEither name)

We can work on any function i -> Either e o, so long as it has an error we can use show with, as explained earlier. We also include the name of the stage here.

We can go ahead and define a stage for our lexer:

lexerStage :: Stage String [Lexer.Token]
lexerStage = makeStage "Lexer" Lexer.lexer

Everything works out nicely, because we took care to design our little Stage abstraction around how the different parts of our compiler actually work.

We’d also like to be able to compose our stages together; that’s why we’re creating this whole abstraction. Let’s make a little helper operator:

(>->) :: Stage a b -> Stage b c -> Stage a c
(>->) (Stage _ r1) (Stage n2 r2) = Stage n2 (r1 >=> r2)

See the similarity here with:

(>>>) :: (a -> b) -> (b -> c) -> (a -> c)

One detail we do here is that the name of the composed stage is just the name of the second stage. We could’ve opted to make a new name, like Lexer >-> Parser or something. This option actually makes more sense, because each stage depends on the previous one, and we never run one half of the stages without the other half. We only stop early, sometimes.

Now, let’s add some code to create an IO action from a given Stage:

printStage :: Show b => Stage a b -> a -> IO ()
printStage (Stage name r) a = case r a of
  Left err -> do
    printStagedError err
    exitFailure
  Right b -> do
    putStrLn (name ++ ":")
    pPrint b

The idea is that we create a function accepting the Stage’s input, and then printing out whatever output it produces, or printing the error, and terminating the program with exitFailure, indicating that a failure occurred.

Parsing Arguments

We’re getting close to having a little program we can run! The next thing we’ll be doing is defining a type for the arguments our program accepts from the command line:

data Args = Args FilePath (String -> IO ())

The first part will be path to the Haskell code we’re handling, but the second part is a bit unorthodox. The idea is that it represents the stage we’ve parsed. We could have created an enum for which stage the user wants to go up to, and then interpreted that to make an IO action, but it’s easier to just parse out an action directly. The reason we have this String -> IO () representation is that every single stage is able to hide behind it. We’ll be creating Stage String b for each of our “go up to this part” stages. They all take in the source code as input, but the output varies. We can use printStage to get around this, and store all of the stages in the same way.

Here’s how we’ll be reading the stages:

readStage :: String -> Maybe (String -> IO ())
readStage "lex" = lexerStage |> printStage |> Just
readStage _ = Nothing

For now, we just accept a single stage “lex”, which runs the only stage we have, and prints out the results.

Then, to use the args we’ve produced, we just read in the source file, and then run the action:

process :: Args -> IO ()
process (Args path stage) = do
  content <- readFile path
  stage content

Since we’ve already read in the stage as String -> IO (), this works out just fine.

We can now write a function to parse out all of the arguments:

parseArgs :: [String] -> Maybe Args
parseArgs (stageName : file : _) =
  Args file <$> readStage (map toLower stageName)
parseArgs _ = Nothing

This takes in the list of arguments as passed in to the program, and returns the parsed arguments. The only trick here is that we lower case the stage name, so that “lex”, “LEX”, “lEx”, etc. refer to the same stage. For example for:

haskell-in-haskell lex file.hs

We would have:

["lex", "file.hs"]

as our arguments.

Remaining Glue

Finally, we can modify main to glue all of these functions together:

main :: IO ()
main = do
  stringArgs <- getArgs
  case parseArgs stringArgs of
    Nothing -> putStrLn "Unrecognized arguments"
    Just args -> process args

getArgs :: IO [String] returns the command line arguments passed to us. We then use these to parse out Args, printing out a failure message, or proceeding with the process function we defined earlier.

Now we can actually run the lexer, to see something like

> cabal run haskell-in-haskell -- lex foo.hs
Lexer Error:
UnimplementedError

Oof, that was a mouthful, but now all of this boilerplate is out of our way for the rest of the compiler. Adding new stages is going to be a breeze!

How do you make a Lexer?

That was a dive into some practical Haskell, although mostly boilerplate. Now let’s go back up a layer of abstraction, and think a bit about what a lexer really is, and how it works in theory.

At the end of the day, our lexer will be scanning over our input string character by character, outputting tokens as it goes.

For simple operators, it’s easy to understand how our lexer is going to work. For example, if we have some operators like:

+*-

the lexer can see the +, and then immediately spit out a Plus token, or something like that. When it sees the *, it spits out Times, with -, we get Sub, etc.

If our lexer sees a character it doesn’t know, it can blow up with an error:

For some other characters, the lexer will simply ignore them:

If we only need to lex out single characters, then an interesting thing to notice is that our lexer keeps track of no state whatsoever. After seeing a single character, the lexer is straight back to the same state it started with, ready to accept another character.

Character Classes

Our lexer will have to produce tokens composed of more than just a single character of input. Integers, for example. We want to accept things like 3, 234, 2239034. Integer literals can be arbitrary strings of digits. To handle this, our lexer needs to enter a different “mode”, or “state” as soon as it sees a digit. Instead of immediately producing some output, it instead shifts to a mode where it tries to consume the digits that remain, and only “breaks out” once it sees something that isn’t a digit. At that point, it can produce the token for that integer literal, and return back to its initial state.

Many more of the tokens we produce will involve multiple characters. For example, to match out against a string literal like "Hello World", we need to realize that a string literal is happening when we see Athe ", and then keep on consuming characters for that literal until we see the closing ".

Quite a few tokens in our language will be keywords. For example, let, where, data, type. These can be consumed using “modes” or “states” as well. For example, when we see an l, our lexer could enter a mode expecting to see e, followed by t to finish out the keyword.

So far we’re developing a vision of a lexer as a “finite state machine”. The idea is that at any point in time, our lexer is in a given state, and we can move to different states as we see different characters, potentially spitting out some tokens along the way. For example, a state machine that accepts arithmetic tokens like 2 + 23 * 44 would look like:

This state machine idea is also intimately connected to regular expressions, and in fact all the different classes of characters that make up our tokens correspond to regular expressions, and we should be able to recognize whether or not our source code is lexically valid with a giant regex.

There’s a rich theory around the connection between lexers, regular expressions, and state machines, so I’ll refer you to other resources if you’re interested in that.

There’s a great chapter on lexing in Crafting Interpreters, which explains these concerns in much more detail.

Engineering a Compiler also has a good section on lexing, if you really want a lot more detail.

Longest Match

Another issue we’ve completely glossed over is that of conflicts between different classes of characters. We’ve specifically looked at nice examples where everything is in a completely different bubble, with no interaction between the different characters.

Now, the biggest source of ambiguity is between identifiers and keywords. For example, if we see where, then that is the keyword where. No ambiguity whatsoever, but if we add just one character, to get where2, then suddenly that’s an identifier, and where2 = 3 is a valid bit of Haskell. So our previous idea of seeing a w, and jumping straight towards assuming we’ve seen the start of the keyword where is not going to work. In fact, if we have some kind of rule like: “an identifier is a lower case alpha character followed by any number of alphas and digits”, then it conflicts with every single keyword!

How do we choose which to pick?

The idea is that we should keep applying a rule greedily. That is, if there are two possible ways of pulling out a token from some string, we pick out the one that consumes more input. So if we see wherefoo, we can parse this in two ways:

where (Keyword) foo
wherefoo (Identifier)

But we should clearly accept the second one, because it consumed more of the input. This rule is enough to disambiguate keywords and identifiers, as long as we give priority to keywords, for when we have an exact match.

Lexer Combinators

So, that was a quick overview of some of the abstract ideas around lexing, and now let’s dive into creating a little framework in Haskell for building up our lexer. The idea is that we want to be able to build up a large lexer for Haskell source code from lexers for smaller parts.

To do this we’ll use Lexer Combinators, a topic I’ve already talked about previously. See that post if you want even more detail and explanation after what we see here.

The Basic Definition

The basic idea of a Lexer Combinator is that a lexer is nothing more than a function taking in some input, in our case, a String, and then either throwing an error, because it does not recognize what it sees, or consuming part of the input, returning a value, along with the remaining part of the input it hasn’t touched.

So, something like String -> Either LexerError (a, String) in Haskell syntax. In fact, let’s go ahead and go back to src/Lexer.hs.

Imports

As is going to become a custom throughout this series, we’re add all of the imports we need for the rest of this module right away:

import Control.Applicative (Alternative (..), liftA2)
import Control.Monad.Except (ExceptT, MonadError (throwError), runExceptT)
import Control.Monad.State (State, gets, modify', runState)
import Data.Char (isAlphaNum, isDigit, isLower, isSpace, isUpper)
import Data.List (foldl', foldl1')
import Data.Maybe (listToMaybe, maybeToList)
import Ourlude

We have some utilities for checking properties of characters, some convenient list folds that sadly aren’t in the default Prelude, and the helper listToMaybe. The other monadic stuff will be introduced in due time as we get to using these utilities

Errors

Let’s go ahead and redefine the type of errors that our lexer will be producing:

data LexerError
  = Unexpected Char
  | UnexpectedEOF
  | UnmatchedLayout
  deriving ()

The first, Unexpected char, is used when we encounter a character that we don’t know how to handle. UnexpectedEOF is when we reach the end of the input string before we’re done lexing some token. For example, if you start a string literal, but never end it, like "abbbbb, then you’ll get an UnexpectedEOF when all of the input has been consumed.

The last error is related to whitespace inference, and will be explained more once we reach that part.

We can go ahead create a utility function to throw the right error given the remaining input:

unexpected :: String -> LexerError
unexpected [] = UnexpectedEOF
unexpected (c : _) = Unexpected c

If there’s no input left, and we’re in an unexpected situation, then we weren’t expecting the input to end so early, and should produce an UnexpectedEOF error, otherwise seeing whatever character is at the front of our input was the problem

The Core Type

Now let’s define the actual data type for lexers:

newtype Lexer a = Lexer {
  runLexer :: String -> Either LexerError (a, String)
}

So a Lexer is nothing more than a wrapper around a function taking in some input, and either failing with a LexerError, or producing a value of type a, and the remaining string that we have yet to consume.

We can go ahead and write a few basic “building blocks” that we’ll be needing later.

The first lexer we’ll make is one that accepts a single character matching a predicate, producing whatever character was accepted:

satisfies :: (Char -> Bool) -> Lexer Char
satisfies p =
  Lexer <| \case
    c : cs | p c -> Right (c, cs)
    rest -> Left (unexpected rest)

This inspects the input we’ve been given, and only consumes the first character if it can pull it out of the string, and it satisfies the predicate we’ve been given.

An immediate application of this lexer is:

char :: Char -> Lexer Char
char target = satisfies (== target)

which will match an exact character. So we might do char '+', for example.

Now we need to implement some fundamental typeclasses for Lexer, which will be used for composing them together.

Functor

The first is the venerable Functor class:

instance Functor Lexer where
  fmap f (Lexer l) = Lexer (l >>> fmap (first f))

Here we have

fmap :: (a -> b) -> Lexer a -> Lexer b

as the concrete type for this function. All this does really is do some plumbing to change the output our lexer produces; fmap has no effect whatsoever on what characters a lexer accepts.

This also gives us access to other functions that use Functor, for example, we could now transform fmap (const Plus) (char '+') into Plus <$ char '+', since <$ :: Functor f => a -> f b -> f a is defined for any type implementing Functor.

Applicative

The next type class is Applicative, which in concrete terms, requires us to define two functions:

pure :: a -> Lexer a

(<*>) :: Lexer (a -> b) -> Lexer a -> Lexer b

The concrete idea here is that pure will create a Lexer that produces a value with consuming any input whatsoever. <*> will allow us to combine two lexers sequentially, using the function output by the first, and the value output by the second to produce a final result.

A function equivalent in power would be:

both :: Lexer a -> Lexer b -> Lexer (a, b)

which expresses the idea a bit more clearly, in my opinion. Here it’s clear that we need to run both lexers, ending up with the combined outputs from both.

So, let’s implement it:

instance Applicative Lexer where
  pure a = Lexer (\input -> Right (a, input))
  Lexer lF <*> Lexer lA =
    Lexer <| \input -> do
      (f, rest) <- lF input
      (a, s) <- lA rest
      return (f a, s)

pure is pretty self-explanatory, we simply pass along the input without consuming anything, as explained before.

For <*>, things are a bit more interesting. Notice how we’re making use of do for Either LexerError here. Another important detail is the we use the remaining input of the first lexer to feed to the second lexer. This might seem like a bit of an arbitrary choice, but it’s actually important.

For example, (,) <$> char 'A' <*> char 'B' should accept the string "AB". To do this, we need to have consumed the character A before “moving on” to the second lexer.

In fact, now that we have Applicative, we can get a lexer recognizing an entire string “for free”

string :: String -> Lexer String
string = traverse char

This uses the very cool traverse :: Applicative f => (a -> f b) -> [a] -> f [b], which in our case, first makes a bunch of lexers recognizing each respective character in a given string, and then sequences them to recognize the entire string, one character at a time.

Alternative

We can create somewhat sophisticated lexers parsing long strings by chaining them sequentially with <*>, but one thing that’s sorely missing is the ability to choose between different options. I can take two lexers and say “I want to lex A, followed by B”, but I can’t say “I want to lex A, or B”. Our tokens are going to need this mechanism. Each token is going to be one of the options we can choose from.

Thankfully, there’s a type class for this: Alternative. Concretely, it requires us to implement:

empty :: Lexer a

(<|>) :: Lexer a -> Lexer a -> Lexer a

<|> implements the intuitive notion of choice we want, and empty acts as the identity for this operation, that is to say:

empty <|> lexerA = lexerA = lexerA <|> empty

Now, <|> is going to have to run both of the parsers. If one of them fails, then we have to rely on the result of the other one. If both succeed, then we have to implement the longest match rule we talked about earlier. If this results in a tie, then we want to give priority to the left one.

Let’s implement the class:

instance Alternative Lexer where
  empty = Lexer (Left <<< unexpected)
  Lexer lA <|> Lexer lB =
    Lexer <| \input -> case (lA input, lB input) of
      (res, Left _) -> res
      (Left _, res) -> res
      (a@(Right (_, restA)), b@(Right (_, restB))) ->
        if length restA <= length restB then a else b

For empty, we have a lexer that will fail on any input.

For <|>, we pattern match on both the results given the same input, at the same time. If one of them fails, we rely on the other result. If both succeed, we need to dive in a bit more, and compare the lengths of the remaining inputs. We pick the result that has less input remaining, i.e. that has consumed more input, i.e. that is the longest match. On ties, when the remaining inputs have equal length, we still choose a. This means that we’re biased towards the left lexer, as we talked about earlier.

We can once again put this to use, and create a helper letting us create a lexer choosing one out of many options:

oneOf :: Alternative f => [f a] -> f a
oneOf = foldl1' (<|>)

This will work with other Alternative types, and not just Lexer, although we won’t be exercising that option here.

With that done, we now have all of the building blocks and combinators we need to make a real lexer for Haskell!

Implementing the Lexer

Alright, we’ve created a bit of a framework for building lexers. This might seem a bit abstract right now, but will all fit into place very quickly as we put it into practice.

Tokens

The first thing we need to do is actually define the type of tokens we want our lexer to return.

We need a couple classes of tokens, basically:

Keywords, like where, let, case
Operators, like +, ->, ::
Literals, which we have for Bool, Int, and String
Hardcoded primitive types, in our case Bool, Int, and String
Identifiers, which come in two flavors in Haskell.

I apologize for the information dump, but it’s easier to just write them all out, since most of them are just dumb labels:

data Token
  = Let -- `let`
  | Where -- `where`
  | In -- `in`
  | Data -- `data`
  | Type -- `type`
  | If -- `if`
  | Then -- `then`
  | Else -- `else`
  | Case -- `case`
  | Of -- `of`
  | Underscore -- `_`
  | OpenParens -- `(`
  | CloseParens -- `)`
  | OpenBrace -- `{`
  | CloseBrace -- `}`
  | Semicolon -- `;`
  | DoubleColon -- `::`
  | ThinArrow -- `->`
  | VBar -- `|`
  | BSlash -- `\`
  | FSlash -- `/`
  | Plus -- `+`
  | PlusPlus -- `++`
  | Dash -- `-`
  | Asterisk -- `*`
  | Equal -- `=`
  | Dollar -- `$`
  | LeftAngle -- `<`
  | Dot -- `.`
  | LeftAngleEqual -- `<=`
  | RightAngle -- `>`
  | RightAngleEqual -- `>=`
  | EqualEqual -- `==`
  | FSlashEqual -- `/=`
  | VBarVBar -- `||`
  | AmpersandAmpersand -- `&&`
  -- An Int literal
  | IntLit Int -- An Int literal
  -- A String literal
  | StringLit String
  -- A Bool literal
  | BoolLit Bool
  -- The type `Int`
  | IntTypeName
  -- The type `String`
  | StringTypeName
  -- The type `Bool`
  | BoolTypeName
  -- A name starting with an uppercase letter
  | UpperName String
  -- A name starting witha lowercase letter
  | LowerName String
  deriving (Eq, Show)

The literals for Int, String, and Bool have values of the corresponding types, as we might expect. Haskell has this peculiarity where identifiers come in two flavors, the upper case identifiers, like Foo230 and Baz240, which are used to indicate types, and lower case, like fooA343, or bazB334.

A lexer for tokens

Now, we can write a Lexer that produces these tokens!

We have:

token :: Lexer (Token, String)
token = keyword <|> operator <|> literal <|> name
  where
    with :: Functor f => b -> f a -> f (b, a)
    with b = fmap (b,)

    ...

with is nothing more than a little helper we’ll be using shortly. Our lexer is defined in terms of a handful of sub-lexers we’ll also be defining in that where block. The high-level idea is that we’re saying that a token is either a keyword, an operator, a literal, or a name. Notice that we put keyword to the left of name, so that keywords have priority over identifiers!

Keywords are defined as lexers accepting exactly the right string:

    keyword :: Lexer (Token, String)
    keyword =
      oneOf
        [ Let `with` string "let",
          Where `with` string "where",
          In `with` string "in",
          Data `with` string "data",
          Type `with` string "type",
          If `with` string "if",
          Then `with` string "then",
          Else `with` string "else",
          Case `with` string "case",
          Of `with` string "of",
          Underscore `with` string "_"
        ]

Here we’re using with in infix notation, in order to include the string along with the token, which we’ll be needing later. Another fun detail is that we consider _ to be a keyword. Whether or not something is a keyword is more so about making the parser’s job easier, rather than reflecting the semantics of the language. We could’ve made _ just a normal identifier, but then we would have had to do extra work to separate out this token in the parser.

Anyhow, let’s look at operators:

    operator :: Lexer (Token, String)
    operator =
      oneOf
        [ OpenParens `with` string "(",
          CloseParens `with` string ")",
          OpenBrace `with` string "{",
          CloseBrace `with` string "}",
          Semicolon `with` string ";",
          DoubleColon `with` string "::",
          ThinArrow `with` string "->",
          VBar `with` string "|",
          BSlash `with` string "\\",
          FSlash `with` string "/",
          Plus `with` string "+",
          PlusPlus `with` string "++",
          Dash `with` string "-",
          Asterisk `with` string "*",
          Equal `with` string "=",
          Dot `with` string ".",
          Dollar `with` string "$",
          LeftAngle `with` string "<",
          LeftAngleEqual `with` string "<=",
          RightAngle `with` string ">",
          RightAngleEqual `with` string ">=",
          FSlashEqual `with` string "/=",
          EqualEqual `with` string "==",
          VBarVBar `with` string "||",
          AmpersandAmpersand `with` string "&&"
        ]

Once again, we’re just mapping strings to labels. All of the code to do the lexing work is taken care of by the framework we set up before!

Now let’s move on to something a bit more interesting: literals:

    literal :: Lexer (Token, String)
    literal = intLit <|> stringLit <|> boolLit
      where
        ...

So, a literal is either an integer literal, a string literal, or a bool literal, as we went over above.

For integers, we have:

        intLit :: Lexer (Token, String)
        intLit =
          some (satisfies isDigit)
            |> fmap (\x -> (IntLit (read x), x))

So, some accepts one or more of a given lexer. It’s actually defined for any Alternative! We accept one or more digits, and then use read to parse that string of digits as an actual Int

For strings, we have:

        stringLit :: Lexer (Token, String)
        stringLit =
          char '"' *> many (satisfies (/= '"')) <* char '"'
            |> fmap (\x -> (StringLit x, x))

The idea is that we require a ", followed by many characters that are not the closing ", and then that closing ". Our manipulation with *> and <* just makes sure that we don’t include the delimiters in our string literal.

And finally, we have boolean literals:

        boolLit :: Lexer (Token, String)
        boolLit =
          (BoolLit True `with` string "True")
            <|> (BoolLit False `with` string "False")

This approach is the same we used for keywords and operators before, and makes sense since we only have two values for Bool.

Now the only remaining “sub-lexer” is for names, where we’ll also be including the names of primitive types:

    name :: Lexer (Token, String)
    name = primName <|> upperName <|> lowerName
      where
        ...

A name is either the name for a primitive type (with priority), a name starting with an upper case alpha, or a name starting with lower case alpha.

For primitive names, we use the same approach as with keywords:

        primName :: Lexer (Token, String)
        primName =
          (IntTypeName `with` string "Int")
            <|> (StringTypeName `with` string "String")
            <|> (BoolTypeName `with` string "Bool")

For normal names, we want a few helpers as well:

        continuesName :: Lexer Char
        continuesName = satisfies isAlphaNum <|> char '\''

        followedBy :: Lexer Char -> Lexer Char -> Lexer String
        followedBy l1 l2 = liftA2 (:) l1 (many l2)

        upperName :: Lexer (Token, String)
        upperName =
          (satisfies isUpper `followedBy` continuesName)
            |> fmap (\x -> (UpperName x, x))

        lowerName :: Lexer (Token, String)
        lowerName =
          (satisfies isLower `followedBy` continuesName)
            |> fmap (\x -> (LowerName x, x))

continuesName is a class of characters that can appear after the first character of an identifier, allowing us to “continue” that identifier. This includes any alphanumeric character, as well as the ', which Haskell also allows.

We then have a helper, followedBy, which parses the first class of things, followed by zero or more occurrences of the second thing. many, is like some, except it allows zero occurrences as well.

With these in hand, are two different kinds of name are simple to define. the difference is that one starts with an upper case character, and the other starts with a lower case character.

An actual lexer

Now, let’s actually replace our lexer function to use this framework:

lexer :: String -> Either LexError [Token]
lexer = runLexer (some token)

instead of token, which only parses a single token, we use some to parse one or more occurrences of a token. At this point, we should be able to lex out some interesting things, although we won’t be handling whitespace between tokens until then.

Let’s try out a few examples.

You can actually run these at this point, creating a file foo.hs to contain the examples, and then using:

cabal run haskell-in-haskell -- lex foo.hs

And then running this on:

x=2+2

should output

[ LowerName "x"
, Equal
, IntLit 2
, Plus
, IntLit 2
]

We can also accept things that make no sense syntactically:

x->::Int

which outputs:

[ LowerName "x"
, ThinArrow
, DoubleColon
, IntTypeName
]

We can already have a lot of fun doing things like this right now! It’s satisfying to start to see a bit of vim from our compiler. But we really need to be able to handle spaces, and also infer semicolons and braces!

Handling Whitespace

Right now we have a lexer for all of the basic building blocks of Haskell, but it can’t handle any whitespace at all! There are two aspects of whitespace that we’re going to be tackling in this section.

Haskell, like basically every language, allows you to put plenty of space between tokens. For example, what we can now lex:

x=2+2*2

could also be written in the prettier form:

x = 2 + 2 * 2

or even:

x   = 2    +    *    2

These should all generate the same tokens. The bits of whitespace contribute nothing, and should be ignored.

In C, for example, everything is delimited using braces and semicolons, so arbitrary amounts of both vertical and horizontal whitespaces are simply completely ignored when lexing. You could remove all whitespace from your program and get the same result.

In Haskell, on the other hand, whitespace can be significant, when it’s used to create layouts where braces and semicolons should be inferred. This layout inference is the second aspect that we’ll be working on, and is much more complicated than simply ignoring whitespace. In fact, because of this aspect, we can’t just take the easy approach of filtering out whitespace characters as a whole, and instead need to be quite picky about what whitespace we see, since vertical whitespaces acts differently than horizonal whitespace.

Haskell’s layout structure

I’m assuming that you’ve programmed a reasonable amount in Haskell before, so you’re probably familiar on an intuitive level with how you can use whitespace to layout blocks. In fact, you probably haven’t used braces and semicolons in your programs, and maybe you didn’t even know they were an option. Let’s try to go from our intuition about how things should work, to a more robust understanding.

As an example, consider something like:

x = foo + bar
  where
    foo = 3
    bar = 4
    
y = 3

The first rule that you’ve likely internalized is that everything at the same level of indentation continues whatever the current block is. Because foo and bar are at the same level of indentation, we expect them to be in the same block.

Similarly, x and y are in the same block. This rule also makes it so that we have to indent the where, so that it belongs to x, and is not seen as continuing the same layout that x and y belong to.

We also require that foo and bar are further indented than the where.

With explicit braces and semicolons, we have:

{
x = foo + bar
  where {
    foo = 3;
    bar = 4
  }
;
  
y = 3
}

This is just the realization of our implicit intuition about where scopes should be, and how they work.

A final rule we’ve all internalized is that layouts only happen after certain keywords.

When we see:

x =
  2 + 3
  * f x

everything after the = is just a single expression, all bound to x. This all belong to x because they’re further indented, but no layout has been introduced after the =.

If we generate semicolons like this:

x = {
  2 + 3;
  * f x
}

that would obviously not be correct.

on the other hand, from a lexical point of view, if we had:

x = a
  where
    2 + 3
    * f x

then we should insert a layout, and end up with:

{
x = a
  where {
   2 + 3;
   * f x
  }
}

The where keyword make us ready to see this kind of whitespace sensitive layout of code. The other keywords like this (for our subset) include let, and of.

Position Information

With our intuitive understanding, we’ve realized that it’s very important to keep track of “where” a token is, both “vertically”, and “horizontally”.

For example:

x = y

does not yield the same layout as:

x =

y

and

let
  x = 1
  y = 2 
in

should not yield the same layout as:

let
  x = 1
    y = 2
in

What column a token appears at is quite important. Whether or not a token is at the start of a line also matters. We’ll need to annotate our tokens with this position information.

So, in Lexer.hs, let’s create some types to represent this information:

data LinePosition = Start | Middle deriving (Eq, Show)

data Positioned a = Positioned a LinePosition Int deriving (Eq, Show)

Positioned a adds position information to something of type a. We include the a itself, followed by its LinePosition, and the exact column at which it appears.

What we’ll be doing soon enough is making a function [Positioned Token] -> [Token], using that extra position information to infer the necessary braces and semicolons.

Right now though, we just have [Token], without any position information, and no braces and semicolons.

Raw Tokens

In fact, our lexer simply blows up if it encounters any whitespace tokens, and we can’t handle comments either. The next order of business is being able to lex whitespace and comments along with normal tokens.

If we weren’t whitespace sensitive, we’d still need to do this. We would simply collect all of these tokens, and then filter them out from the final output. This way, we could allow arbitrary spacing between tokens, as well as comments in the source code, without affecting the rest of the tokens.

For our purposes, it’s important to keep these tokens, so we can annotate the other “normal” tokens with position information. We’ll be using the whitespace and comment tokens around them to figure out how they’re positioned within the file.

So, let’s define a new token type that adds these extra variants:

data RawToken
  = Blankspace String
  | Comment String
  | Newline
  | NormalToken Token String

Blankspace represents “horizontal” whitespace, i.e whitespace containing no newlines. For newlines, we have the Newline token. We also have comments, with the Comment constructor, and then finally a little wrapper for normal tokens, along with their string contents.

Let’s go ahead and create a lexer for these raw tokens:

-- A Lexer for raw tokens
rawLexer :: Lexer [RawToken]
rawLexer = some (whitespace <|> comment <|> rawToken)
  where
    rawToken = fmap (uncurry NormalToken) token
    ...

So, our raw lexer parses at least one occurrence of a raw token, which is either whitespace, a comment, or a normal token.

For comments, we need to parse a --, and then the rest of the line:

    comment = Comment <$> (string "--" *> many (satisfies (/= '\n')))

To allow for empty comments, we pick many instead of some.

Now, for whitespace, we can either have horizontal whitespace, or a single newline:

    whitespace = blankspace <|> newline

If we have multiple newlines, we’ll end up generating multiple tokens.

Horizontal whitespace is just one or more whitespace characters that are not newlines:

    blankspace =
      Blankspace <$>
        some (satisfies (\x -> isSpace x && x /= '\n'))

And a newline is generated when we see a \n:

    newline = Newline <$ char '\n'

With that, we can now parse out the whitespaces and comments. We could now make a lexer for “whitespace insensitive Haskell”, if we wanted to. We would replace some token with rawLexer, in our lexer function, and then filter out the normal tokens.

Positioning Tokens

The next step is using all of this information to annotate the normal tokens using the Positioned type we defined earlier. We want to have a function [RawToken] -> [Positioned Token]. Given the list of raw tokens, we produce a list of tokens with positions. In the process, all of the whitespace and comment tokens are discarded.

The basic idea of the algorithm is going to be a somewhat imperative one. We’ll keep track of (LinePosition, Int), which will serve as the position of the next token, when we end up producing it. As we see raw tokens, we’ll update this state, or actually produce a token positioned with this information.

In practice, we have this:

type PosState = (LinePosition, Int)

position :: [RawToken] -> [Positioned Token]
position = foldl' go ((Start, 0), []) >>> snd >>> reverse
  where
    go :: (PosState, [Positioned Token]) -> RawToken -> (PosState, [Positioned Token])

So, we’re folding from the left, and the accumulator will hold the current position information, as well as a list of tokens we’re producing. After the fold is done, we discard this position state, but keep the tokens we produced. We produce a new token by adding it the front of the list, so we want to reverse this order after we’re finished.

Before we define go, we need a little helper function eat:

eat :: PosState -> RawToken -> (PosState, Maybe (Positioned Token))

eat will take in the current position information, and a raw token, and then adjust the current position. It may also produce a token with position information, when we come across an actual token, and not one of the extra whitespace or comment tokens.

The implementation looks like this:

    eat :: PosState -> RawToken -> (PosState, Maybe (Positioned Token))
    eat (pos, col) = \case
      Newline -> ((Start, 0), Nothing)
      Comment _ -> ((Start, 0), Nothing)
      Blankspace s -> ((pos, col + length s), Nothing)
      NormalToken t s ->
        let token = Positioned t pos col
        in ((Middle, col + length s), Just token)

When we see a newline, regardless of what position we had earlier, the new position will be at the first column, at the start of the next line.

A comment acts in the same way, since a comment is effectively like a new line, in terms of positioning.

When we see blank space, we advance the current column, while keeping but our position at the start or middle of a line is left unchanged. If we were previously at the start of a line, and then see some blank space, we’re still at the start of the line, since we haven’t seen a “real” token yet.

When we see an actual token, we’re going to produce that token, along with the current position information. We then adjust the position information to now be in the middle of that line, and we move the column forward based on how long the string contents of that token were.

With that done, the implementation of go is straightforward:

    go :: (PosState, [Positioned Token]) -> RawToken -> (PosState, [Positioned Token])
    go (p, acc) raw =
      let (p', produced) = eat p raw
      in (p', maybeToList produced <> acc)

We adjust the position information according to eat, and add a token to our accumulator if eat happened to produce one.

Alrighty, now we have a little function to produce the position information for our tokens, all we need is to make an algorithm that uses all of this information to insert braces and semicolons in the right places!

A Stateful Algorithm

The algorithm we’re going to be working on is an imperative one. The idea is that we’ll be iterating over the positioned tokens, while keeping track of some bits of state, and then yielding tokens as we go. In response to the position information, we might modify our current state, or yield some extra semicolon or brace tokens. We also want to yield all of the original tokens from the source code.

Layouts

The first thing we need to keep track of are the layouts

data Layout = Explicit | Implicit Int

We distinguish between Explicit layouts, which are created by the user with {}, and Implicit layouts, which are inferred based on whitespace.

Let’s say we have some code like:

let {
  x = 3
} in x

When we see a {, we’ve entered an Explicit layout, and we need to keep track of this information, because we will not be inserting any semicolons while we remain in this layout, and we’ll also need an explicit closing brace.

On the other hand, if we have:

let
  x = 3
in x

then this is an Implicit layout, in fact Implicit 2, since the (zero-based) column where the layout is at is 2. This number matters, as we’ve seen before. This is because:

let
  x = 3
  y = 2
in x + y

has a very different meaning from

let
  x = 3
    y = 2
in x + y

The former should see a semicolon inserted between the two definitions, whereas the second will have us end up with x = 3 y = 2, as if they were on one line. Syntactically, this will get rejected by our parser, but our lexer isn’t going to bat an eye yet.

We also want to be able to nest layouts, like this:

let {
  x = let
    y = 3
  in y
} in x

To do this, we’ll be keeping a stack of layouts. This will allow us to push and pop layouts as we enter and exit them.

Expecting Layouts

Another rule we mentioned in passing before is that let, where and of are places where a layout can start. For example, when you have:

let
  x = 2
in x

After seeing let, we need to keep our eyes peeled for the next token, since that token might trigger the start of a new implicit layout. The column where that token is positioned becomes the column for that new layout as well.

On the other hand, if we see a { after a let, that indicates that we’re not going to be starting an implicit layout, in which case the indentation of that { doesn’t matter at all.

We’ll need to keep track of whether or not we’re expecting a layout. When we see let, where or of, we’ll set that to true, and then set it back once we’ve found the next token.

The Full State

The final thing we’ll need, mainly for convenience, is a way to yield tokens. Since we might want to yield extra tokens at certain points, it’d be nice to have a way to modify the state to have an extra few tokens in its “output buffer”. We’ll have a [Token], and then use : to add new items. Once we’ve traversed all of the positioned tokens, we can reverse this list, and return it, and this will become the final output for the lexer.

So, our full state looks like this:

data LayoutState = LayoutState
  { layouts :: [Layout],
    tokens :: [Token],
    expectingLayout :: Bool
  }

layouts is the stack of layouts we mentioned earlier, expectingLayout the flag for let, where, and of, and tokens is the “output buffer” we’ve just talked about. Our code will be inspecting this state, and modifying it, as it scans through the positioned tokens we feed it.

Contextual Computation

In this section, we have our first use of Monad Transformers to create a little context, or DSL, in which to do some kind of operation. We’ll be seeing plenty more of these throughout the rest of this series, so I think it’s a good idea to explain the general idea here. If you’re already familiar with some of the practical applications of Monad Transformers and the mtl library, then feel free to skim, or skip over this section.

In normal Haskell code, you’re just working with plain old values, and function application. When you add some definitions with let, say:

let x = f a b c
    y = g a b c
in x + y

There’s no code “in between” those two definitions. The order in which we defined things doesn’t matter. This is somewhat obvious, but consider a slightly different version of this code:

do
x <- f a b c
y <- g a b c
return (x + y)

This bit of code is quite similar to the first version, but something very important has changed: we’re now in a do block. Because we’re in such a block, it becomes important not only to consider each line, but also the order in which the lines occur, as well as what kind of Monad, or context the do block is for.

What do enables us to go from strictly pure computations, to computations in a given context, or augmented with some kind of extra capability. One of the contexts we used recently was Maybe. Maybe allows us to do some computation with the possibility of failure. do does the job of sequencing together all of the code, using the >>= function that Maybe implements. In the above example, if f were to return Nothing, then the result of the whole block would be Nothing.

There are other kinds of contexts that will be useful to us quite soon. One example would be State s. This allows us to have some computations that manipulate some state of type s. We can query for the current state, and also modify it. It’s as if we embedded a little imperative DSL into our Haskell code. Of course, State s is essentially sugar for s -> (a, s), and >>= does the drudge work of passing the correct state around as new states are returned. No mutation actually happens, but plenty of mutation is certainly expressed in the DSL provided by State s.

Where transformers come into play is in their ability to let us combine multiple contexts together. If we want the ability to fail with an error, and to manipulate a given state, we’ll need to combine both State s and Either e. This is where things like StateT and ExceptT are useful.

LayoutM

We can use these raw combinations directly, or encapsulate them behind a newtype. We’ll be using newtypes later on. The idea behind wrapping a stack of transformers behind a newtype is that the DSL we want to provide is quite different than that of the different transformers. The transformers are just an implementation detail for the little imperative language we want to embed.

In this case, we’ll be using a “naked” stack, but later on we’ll mostly be using newtype stacks.

We’ll be needing a context / stack for our algorithm, which we’ll aptly call, LayoutM:

type LayoutM a = ExceptT LexerError (State LayoutState) a

We’re combining two Monads here, Except LexerError, which allows to fail with errors of type LexerError, and State LayoutState, which allows us to inspect and manipulate the LayoutState type that we defined earlier.

We can go ahead and create a little helper that will run a computation in LayoutM, returning the tokens it produced, in the right order:

runLayoutM :: LayoutM a -> Either LexerError [Token]
runLayoutM =
  runExceptT >>> (`runState` LayoutState [] [] True) >>> \case
    (Left e, _) -> Left e
    (Right _, LayoutState _ ts _) -> Right (reverse ts)

The idea behind running the stack is that you need to unwrap the layers from outside to inside. So first we unwrap the ExceptT layer, and then unwrap the State layer. We need to provide a starting state, of course. That starting state indicates that no layouts have been seen yet, no tokens have been produced, but we are expecting a layout at the start of a file!

Finally, we’ll look at the output result, which will be (Either LexerError a, LayoutState), We don’t care about the output itself, only the tokens in the state, which we need to reverse, as explained before.

We expect a layout at the start of a file, because our source code will consist of multiple definitions:

x :: Int
x = y

y :: Int
y = 2

and we want to be able to infer the semicolons between the different statements, and thus the braces surrounding all of the code.

Basic Operations

As a bit of a warmup, let’s write a few of the fundamental operations we’ll be using. These make up a bit more of the “DSL” we’re making with LayoutM.

Yielding Tokens

The first operation we need is the ability to produce a token:

yieldToken :: Token -> LayoutM ()
yieldToken t = modify' (\s -> s {tokens = t : tokens s})

We modify the current state, producing a new state with the token at the front of our output buffer.

Layout Operations

We’ll be needing an operation to a push a new layout onto the stack:

pushLayout :: Layout -> LayoutM ()
pushLayout l = modify' (\s -> s {layouts = l : layouts s})

This one operates in the same way as yieldToken, except on layouts, instead of tokens.

And then we’ll need an operation to remove a layout from the stack:

popLayout :: LayoutM ()
popLayout = modify' (\s -> s {layouts = drop 1 (layouts s)})

We also want to be able to inspect the current layout:

currentLayout :: LayoutM (Maybe Layout)
currentLayout = gets layouts |> fmap listToMaybe

If layouts ends up being empty, then we’ll end up returning Nothing here.

An immediate application of this little helper is a function:

compareIndentation :: Int -> LayoutM Ordering

Which tells us whether or not an implicit layout starting at a given column would be considered less than (LT), equal to (EQ), or greater than (GT) than the current layout:

compareIndentation :: Int -> LayoutM Ordering
compareIndentation col =
  let cmp Nothing = GT
      cmp (Just Explicit) = GT
      cmp (Just (Implicit n)) = compare col n
   in fmap cmp currentLayout

The idea is that any implicit layout is greater than no layout at all, or an explicit layout. On the other hand, if we have another implicit layout, we need to compare the columns. We’ll be seeing how we use this comparison soon enough.

The Algorithm Itself

Alrighty, we’ve built in the background knowledge and some of the basic building blocks to actually start dissecting the layout algorithm itself. Here’s what it looks like:

layout :: [Positioned Token] -> Either LexerError [Token]
layout inputs =
  runLayoutM <| do
    mapM_ step inputs
    closeImplicitLayouts
  where
    ...

So layout takes in the positioned tokens we made earlier, and then converts them into a list of normal tokens, including the generated semicolons and braces. We might also fail with a LexerError.

What we do is run a layout action, which loops over the inputs using the step function, which will modify the state, and produce tokens, and then we have a function closeImplicitLayouts. The idea behind this operation is that at the end of a file, if any implicit layouts are still open, we need to generate the explicit closing braces for them.

For example, if we have something like:

x = y
  where
    y = z
      where
        z = 3

Then we want to end up with:

{
x = y where {
  y = z where {
    z = 3
  }
}
}

So we need to generate all of those closing braces when we reach the end of the token stream. Thankfully, since we keep a stack of the layouts we’ve entered, we’ll have 3 pending implicit layouts: one for the start of the file, and 2 others for the wheres.

So, we have the following definition:

    closeImplicitLayouts :: LayoutM ()
    closeImplicitLayouts =
      currentLayout >>= \case
        Nothing -> return ()
        Just Explicit -> throwError UnmatchedLayout
        Just (Implicit _) -> do
          yieldToken CloseBrace
          popLayout
          closeImplicitLayouts

We loop until there are no layouts left on the stack. If at any point, we see an explicit layout, then that’s an error. It means that we have something like:

{
x = 2

and we never close explicit braces ourselves. We have an unmatched { by the time we’re at the end of the file, and that’s an error.

Whenever we see an implicit layout, we generate the necessary closing }, and then remove that layout from the stack.

Another helper we’ll need is startsLayout:

    startsLayout :: Token -> Bool
    startsLayout = (`elem` [Let, Where, Of])

This encodes the fact we’ve talked about a few times, where let, where, and of are the tokens that start a layout.

The Step

The crux of the algorithm is the step function, which contains all of the core rules behind how layouts work in our subset of Haskell:

    step :: Positioned Token -> LayoutM ()
    step (Positioned t linePos col) = do
      expectingLayout' <- gets expectingLayout
      case t of
        CloseBrace -> closeExplicitLayout
        OpenBrace | expectingLayout' -> startExplicitLayout
        _
          | startsLayout t -> modify' (\s -> s {expectingLayout = True})
          | expectingLayout' -> startImplicitLayout col
          | linePos == Start -> continueImplicitLayout col
          | otherwise -> return ()
      yieldToken t

In certain situations, whether or not we’re expecting a layout matters, so we first retrieve that from the state.

Regardless of how we process that token, we will add it to the output buffer afterwards, hence the yieldToken t at the end of the case. We want to look at the token to decide what extra handling needs to be done though.

If we see }, that means that the user is intending to close a layout that they created explicitly using a {, and we handle that case in the closeExplicitLayout operation.

If we see a {, and we’re at a point in the stream where a layout can be started, because we’re after let, where or of, then we start an explicit layout, using startExplicitLayout.

Otherwise, the token itself doesn’t matter, but other properties and conditions do.

If that token starts a layout, then we need to set this flag to true in our state.

Otherwise, if we’re expecting a layout, then we should start an implicit layout at the column where this token appears.

So if we have:

let
  x =

After seeing let, we’re now expecting a layout. Then when we see x, we decide to start an implicit layout at column 2.

If we’re not expecting a layout, and we see the first token on a given line, then we’re continuing a layout that already exists, knowing that whatever layout we’re continuing is at the column col. This actually handles a few different cases, as we’ll see.

Closing Explicit Layouts

For closeExplicitLayout, the idea is that after seeing }, this must close an explicit layout started with {, and this layout must be at the top of the stack. The reason for this is that we don’t allow } to close implicit layouts itself.

For example:

{
let
  x = 2 }

this would be an error. This is because we start an implicit layout after the x, and } isn’t allowed to close that implicit layout itself.

Concretely, we have:

    closeExplicitLayout :: LayoutM ()
    closeExplicitLayout =
      currentLayout >>= \case
        Just Explicit -> popLayout
        _ -> throwError (Unexpected '}')

If the current layout is explicit, that’s what we expect, and we remove it from the stack, since the } has closed it. Otherwise, we throw an error, for the reasons mentioned above.

Starting Explicit Layouts

This is sort of the reverse operation. We need to push an explicit layout for the opening { we’ve just seen onto the stack. We also need to indicate that we’re no longer expecting a layout, since the { starts the layout we were expecting:

    startExplicitLayout :: LayoutM ()
    startExplicitLayout = do
      modify' (\s -> s {expectingLayout = False})
      pushLayout Explicit

Starting Implicit Layouts

For startImplicitLayout, let’s have a look at the code, and explain what’s going on with some examples afterwards. This function will be called when we’ve seen our first token (that isn’t a { or a }) after seeing a layout starting token.

    startImplicitLayout :: Int -> LayoutM ()
    startImplicitLayout col = do
      modify' (\s -> s {expectingLayout = False})
      compareIndentation col >>= \case
        GT -> do
          yieldToken OpenBrace
          pushLayout (Implicit col)
        _ -> do
          yieldToken OpenBrace
          yieldToken CloseBrace
          continueImplicitLayout col

Now, regardless of how we handle things, we’re no longer expecting a layout after this is over. We’ll have handled things, one way or another.

What happens next depends on how the column we saw this token at compares with the current indentation.

If we’re strictly further indented, then we start a new implicit layout, and push the { token that goes with it. Otherwise, we push a complete, but empty layout with a { and a matching } token, and then use continueImplicitLayout.

This catches the case where we have something like:

let
  x = 2 where
  y = 3
in x + y

After the where, we are indeed expecting a layout, and y at column 2 is the token we expect to handle that fact. But, since y = 3 is at the same level of indentation as x = 2, it belongs to the layout started after let! Because of this, this where block is actually empty, and the tokens we produce are:

let {
  x = 2 where {};
  y = 3
} in x + y

Now, if the y = 3 had been further indented:

let
  x = 2 where
    y = 3
in x

Then this would need to produce:

let
  x = 2 where { y = 3 }
in x

Continuing Implicit Layout

So continueImplicitLayout handles the case where we have some token at the start of a line, or after a layout token, that needs to continue the current implicit layout in some way. Another case we’ll handle here is closing that implicit layout automatically.

Layouts are closed when we see a token that’s less indented. For example:

x = y
  where
    y = 2

z = 3

The where layout closes when we see z = 3 indented less than y = 2, giving us:

{
x = y where { y = 2 };

z = 3
}

Concretely, we have:

    continueImplicitLayout :: Int -> LayoutM ()
    continueImplicitLayout col = do
      closeFurtherLayouts
      compareIndentation col >>= \case
        EQ -> yieldToken Semicolon
        _ -> return ()
      where
        closeFurtherLayouts =
          compareIndentation col >>= \case
            LT -> do
              yieldToken CloseBrace
              popLayout
              closeFurtherLayouts
            _ -> return ()

The first thing we do is close every implicit layout that we’re strictly below, for teh reasons we just went over. To close them we emit a }, and then pop the layout off of the stack.

Then we check whether or not there’s a layout matching our indentation exactly, in which case we need to emit a semicolon.

This is so that we have the semicolons in something like:

let
  x = 2
  y = 3
  z = 4
in x + y + z

With this way of doing things, we’ll be generating the semicolon necessary between the tokens when we see them continuing the layout, so we have something like:

let {
  x = 2
  ;y = 3
  ;z = 4
} in x + y + z

The rest of the logic we had earlier makes it so that by the time we call continueImplicitLayout, we know that there’s some other token before us, and we need a semicolon as a separator. That token will have use startImplicitLayout itself, producing the layout we find ourselves in.

Glue Code

We can now glue all of this together, and produce a lexer function that does all of these positioning rules:

lexer :: String -> Either LexerError [Token]
lexer input =
  runLexer rawLexer input >>= (fst >>> position >>> layout)

Some Examples

And that’s it! We’re done! We’ve made a complete lexer for our subset of Haskell, handling whitespace in all of its glory. For the sake of completeness, let’s see how it works in some examples.

As a reminder, we can run our lexer using:

cabal run haskell-in-haskell -- lex file.hs

First, if we just have some basic definitions:

x = 1

y = 2

z = 3

We see that the whole thing is surrounded in braces, and semicolons are inserted at the right spots:

[ OpenBrace
, LowerName "x"
, Equal
, IntLit 1
, Semicolon
, LowerName "y"
, Equal
, IntLit 2
, Semicolon
, LowerName "z"
, Equal
, IntLit 3
, CloseBrace
]

If we have a simple nested let:

z =
  let
    x = let
      y = 3
    in y
  in x

[ OpenBrace
, LowerName "z"
, Equal
, Let
, OpenBrace
, LowerName "x"
, Equal
, Let
, OpenBrace
, LowerName "y"
, Equal
, IntLit 3
, CloseBrace
, Semicolon
, In
, LowerName "y"
, CloseBrace
, In
, LowerName "x"
, CloseBrace
]

If you’ve followed along so far, congratulations, I hope that wasn’t too much typing! You can have quite a bit of fun trying out different programs, and seeing what output your lexer produces!

Conclusion

So, this post was longer than I expected it to be! Hopefully I did a decent job of introducing lexing, as well as the lexer for our subset of Haskell. I hope that the layout rules are much clearer now that you’ve actually had to implement all of their details. I’ve tried to organize the rules in a way that’s reasonably understandable.

As always, here’s. the full code for this part.

There’s a section in the Haskell ‘98 report that talks about Haskell syntax, and the layout rules, in particular. This might be an interesting read, and you can contrast the way they define rules there with the imperative algorithm we’ve set up here.

Crafting Interpreters is another fun and very approachable read if you want to know more about lexing.

In the next post, we’ll be going over parsing! By the end of that post, we’ll have gone from a simple stream of tokens to a full syntax tree for our subset of Haskell!