Protein: Lexer (Part 1)

As promised in a LinkedIn Poll, we are going to develop Parser for proto files which will create an AST that is serializable in Protobuf itself. Obviously, this is going to be a series of articles because we need to write quite a lot of parsing code. However, I believe that this is worth doing since we are going to see another use of Protobuf outside of gRPC.

In this article, we are going to present the project and start writing a Lexer. This is not the most amusing part. However, this will be the foundations for our part 2 in which we will start more complex tokenizing of the input.

The Project

The project is called protein. In itself, this is intended to be a project showing how to parse proto files to do tools for Protobuf. But the end goal is to provide an easy-to-use and efficient linter.

The project is mostly divided into three components:

  • The Lexer: it tokenizes the input and provides metadata about these tokens.
  • The Parser: it verifies the validity of our tokens and builds an AST.
  • The Linter: it analyzes the AST and provide feedback to users.

Note: There might be more components later but these are the major ones.

One thing worth mentioning is that the AST will be built with objects generated by Protobuf. This means that we will have an AST that we can serialize and maybe even cache for multiple runs of our Linter.

Finally, as mentioned, this project shows how to build tools for Protobuf. Thus I hope the readers will be inspired to create their own tools and/or contribute to this project. I'm going to document all the code and explain all the logic so that you can follow along and even participate in the development.

The Lexer

The protein Lexer has been inspired by one of the great pieces of code out there: The template library lex.go in the Go standard library. I was introduced to this code by an amazing talk given by Rob Pike. So I want to thank all the contributors and Rob for the amazing talk. Thank you guys!

The essence of the Lexer is a Finite-state Machine (FSM). This basically means that we are going to go from state to state by applying functions to do the transition. For example, let's say that we have the following line:

syntax = "proto3";

We will go through the following states:

initial state -> Identifier("syntax") -> Equal -> Str("proto3") -> EOF

And each state will know how to parse their own value (e.g. "proto3" for the Str state).

Finally, the Lexer will be used by calling the NextToken function. This function, as its name suggests, will return the next token in the input. This is basically calling all the parsing functions from each state.

Let's Start Simple

Let's fullfil the first requirement. The user code, in our case the Parser, will interact with the Lexer by repeatedly calling the NextToken function. So let's define an interface because that will be handy later on in this series to test the Parser on a fake Lexer.

  • package lexer
    
    // Lexer is protein's tokenizer
    type Lexer interface {
      // NextToken returns the following token in the input source.
      NextToken() Token
    }
    

For now, "input source" is still a little vague but we are going to deal with that later in the implementation.

After that, we do not have the definition of a Token. In our application, a token is a collection of few information about some text in the input:

  • A type: a simple way to identify tokens (e.g., Identifier, Str, ...)
  • A literal: the value of a token as a string
  • A position: the position of this text in the input

Let's define all that. Let's start with position. We want to know the offset relative to the beginning of the file and we want to know the line and column at which the token shows up. These info are for making the error messages clearer to the user (at least we'll try our best!). Instead of having something like:

unterminated string

we want to have:

file.proto 1:10 unterminated string: "this is unterminated

Where, in this example 1 is the line and 10 is the column.

So a position is pretty simple.

  • package lexer
    
    // Position is a position in the input
    type Position struct {
      // Offset is the position relative to the beginning of the file (starts at 0)
      Offset int
    
      // Line is the file line (starts at 1)
      Line int
    
      // Column is the offset relative to the beginning of the line (starts at 0)
      Column int
    }
    

And with that we can now define our Token.

  • package lexer
    
    // TokenType is an alias type which tells of which kind the token is
    type TokenType int
    
    // Token is a piece of the input
    type Token struct {
      Type    TokenType
      Literal string
      Position
    }
    

Token Types

Right now, we simply have the TokenType type alias but do not define any value. We are going to define an enum for all the possible values that the Token.Type property can have.

  • //...
    // These are all the token types
    const (
      EOF          TokenType = iota - 1 // End Of File
      TokenIllegal                      // Illegal token
      TokenError                        // Error
      TokenSpace                        // Space (whitespace, '\n', '\r', '\t')
      TokenComment                      // Comment (single line or multiline)
    
      TokenIdentifier // Identifier
      TokenInt        // Integer
      TokenFloat      // Float
      TokenStr        // String ('...' or "...")
    
      TokenUnderscore  // _
      TokenEqual       // =
      TokenColon       // ,
      TokenSemicolon   // ;
      TokenDot         // .
      TokenLeftBrace   // {
      TokenRightBrace  // }
      TokenLeftSquare  // [
      TokenRightSquare // ]
      TokenLeftParen   // (
      TokenRightParen  // )
      TokenLeftAngle   // <
      TokenRightAngle  // >
    )
    

There are few things to notice here. The first one is the use of iota. If you are not familiar with this, this is a keyword that lets us define multiple constant with consecutive values. Here, we use iota - 1 because, by default, iota starts counting at zero but here we want to start at -1. The second thing to notice is that some of these tokens have constant values (e.g., TokenEqual == '=') and others don't have one (e.g., Identifier value will be evaluated at runtime).

For debugging and testing, we will need a way to display the TokenType as a string. In Go, we can simply do that by defining a function called String() on the TokenType type. And to define the value we are going to have a constant array defining values based on the type (remember TokenType is just an int).

  • //...
    var tokenTypeStr = [...]string{
      "EOF",
      "Illegal",
      "Error",
      "Space",
      "Comment",
      "Identifier",
      "Integer",
      "Float",
      "String",
      "_",
      "=",
      ",",
      ";",
      ".",
      "{", "}",
      "[", "]",
      "(", ")",
      "<", ">",
    }
    
    func (t TokenType) String() string {
      return tokenTypeStr[t+1] // +1 because we start at iota - 1
    }
    

Implementing a Basic Lexer

With all of the previous boilerplate, we can now start our implementation. Don't get too excited though, we are simply going to define an "empty shell" Lexer in order to be able to write a failing test (see Red/Green testing).

As we know, we need to implement the NextToken function.

  • package lexer
    
    type Impl struct{}
    
    func (l *Impl) NextToken() Token {
      return Token{
        Type: TokenIllegal,
      }
    }
    

Furthermore, we need a simple way to instantiate our Lexer. To do so, we are going to create a function called New with a return type being Lexer and will return an instance of Impl.

  • //...
    
    // New creates a new instance of the Lexer
    func New() Lexer {
      return &Impl{}
    }
    

Writing a Failing Test

Now let's start with a simple test. We are going to test all the symbols (Underscore, Equal, ...). At the end of this article, we aim to make that test pass.

We are going to use a technique called table-driven tests. To do so, we are going to define a type that represents an expected Token.

  • type Check struct {
      expectedType     TokenType
      expectedLiteral  string
      expectedPosition Position
    }
    

I'm calling it Check because it will be used as for a single check in a list of checks. We are going to iterate over that list of checks and do asserts on the three properties.

  • func runChecks(t *testing.T, l Lexer, tests []Check) {
      for i, tt := range tests {
        tok := l.NextToken()
    
        if tok.Type != tt.expectedType {
          t.Fatalf("tests[%d] - tokentype wrong. expected=%q, got=%q", i, tt.expectedType, tok.Type)
        }
    
        if tok.Literal != tt.expectedLiteral {
          t.Fatalf("tests[%d] - literal wrong. expected='%s', got='%s'", i, tt.expectedLiteral, tok.Literal)
        }
    
        // asserts on Position for later
      }
    }
    

With that we can now write a compiling and failing test.

  • //...
    
    import (
      "testing"
    )
    
    //...
    
    func TestNextTokenOnSymbols(t *testing.T) {
      runChecks(t, New(), []Check{
        {TokenUnderscore, "_", Position{}},
        {TokenEqual, "=", Position{}},
        {TokenColon, ",", Position{}},
        {TokenSemicolon, ";", Position{}},
        {TokenDot, ".", Position{}},
        {TokenLeftBrace, "{", Position{}},
        {TokenRightBrace, "}", Position{}},
        {TokenLeftSquare, "[", Position{}},
        {TokenRightSquare, "]", Position{}},
        {TokenLeftParen, "(", Position{}},
        {TokenRightParen, ")", Position{}},
        {TokenLeftAngle, "<", Position{}},
        {TokenRightAngle, ">", Position{}},
        {EOF, "", Position{}},
      })
    }
    

and if we test:

$ go test ./...
--- FAIL: TestNextTokenOnSymbols (0.00s)
    lexer_test.go:19: tests[0] - tokentype wrong. expected="_", got="Illegal"
FAIL

That was expected.

Improving Impl

If you didn't notice, our implementation for the Lexer is useless. It's not storing any state and it doesn't even process text. Let's change that.

Let's first look at the states our Lexer will store.

  • // Impl is the implementation for the Lexer interface.
    type Impl struct {
      src             string // the input text
      start           int    // the start of a token
      startLine       int    // the line at which a token start
      startLineOffset int    // the offset of the starting line relative to beginning of file
      line            int    // the current file line being process
      pos             int    // the reading position in the file
      atEOF           bool   // tells wether the Lexer is finished
      token           Token  // the token to return
    }
    

That seems like a lot but don't worry, this is actually pretty intuitive once we start interacting with these properties.

Now, we can improve our New function a little bit.

  • //...
    
    // New creates a new instance of the Lexer
    func New(input string) Lexer {
      return &Impl{
        src:       input,
        line:      1, // lines are 1-indexed
        startLine: 1,
      }
    }
    

We are finally storing text and some indices to keep track of where we are in it.

We can now start to think about emitting token in the NextToken function but before actually emitting tokens we are going to need thinking about the FSM.

As mentioned, we are going to create states and transitions between these states. These states will be represented as functions and the transitions will be made by returning another state after the processing needed for the current state.

We define a state as follows:

  • //...
    
    type stateFn func(*Impl) stateFn
    

You can notice that this is a function, taking an Impl as parameter and returning a stateFn. Think about is as a list of states linked together.

With that, we are now able to define the NextToken logic.

  • //...
    
    // NextToken provides the following token in the input
    func (l *Impl) NextToken() Token {
      state := lexProto
      for {
        state = state(l)
        if state == nil {
          return l.token
        }
      }
    }
    

Where lexerProto is the initial state and where we iterate until state it becomes nil.

lexProto

lexProto Is the initial state of our FSM. This is a stateFn. This means that it interacts with Impl and returns another stateFn.

For now, we can make this function really simple. It is a function that checks the next character and checking if this character is a symbol. If it is, we return a Token with the relevant TokenType, otherwise we return an Illegal Token.

  • func lexProto(l *Impl) stateFn {
      switch r := l.next(); {
      case r == '_':
        return l.emit(TokenUnderscore)
      case r == '=':
        return l.emit(TokenEqual)
      case r == ',':
        return l.emit(TokenColon)
      case r == ';':
        return l.emit(TokenSemicolon)
      case r == '.':
        return l.emit(TokenDot)
      case r == '{':
        return l.emit(TokenLeftBrace)
      case r == '}':
        return l.emit(TokenRightBrace)
      case r == '[':
        return l.emit(TokenLeftSquare)
      case r == ']':
        return l.emit(TokenRightSquare)
      case r == '(':
        return l.emit(TokenLeftParen)
      case r == ')':
        return l.emit(TokenRightParen)
      case r == '<':
        return l.emit(TokenLeftAngle)
      case r == '>':
        return l.emit(TokenRightAngle)
      }
    
      return l.emit(TokenIllegal)
    }
    

next()

next Is also pretty simple. It basically checks if the position is valid. If it isn't it returns EOF, otherwise it takes the next rune (utf8 character), updates the pos, updates the line if needed, and return the rune read.

  • func (l *Impl) next() rune {
      if int(l.pos) >= len(l.src) {
        l.atEOF = true
        return rune(EOF)
      }
    
      r, w := utf8.DecodeRuneInString(l.src[l.pos:])
      l.pos += w
    
      if r == '\n' {
        l.line++
      }
      return r
    }
    

emit(TokenType)

emit Is a function that checks our current location in the input text, creates a token, and return nil to stop the loop in NextToken.

  • func (l *Impl) emit(tt TokenType) stateFn {
      t := Token{
        Type:    tt,
        Literal: l.src[l.start:l.pos],
        Position: Position{
          Offset: l.start,
          Line:   l.startLine,
          Column: l.start - l.startLineOffset,
        }}
      l.start = l.pos
      l.startLine = l.line
      if tt == TokenSpace && strings.Contains(t.Literal, "\n") {
        l.startLineOffset = l.start
      }
      l.token = t
      return nil
    }
    

Note: this is a lot of small operations. Take the time to go through the comments written in the Impl struct. This is not that hard.

Handling EOF

If we run our test now, we will get the following:

$ go test ./...
--- FAIL: TestNextTokenOnSymbols (0.00s)
    lexer_test.go:20: tests[13] - tokentype wrong. expected="EOF", got="Illegal"
FAIL

This is basically saying: "we received an Illegal Token but we expected EOF".

This is pretty trivial to solve. Remember the atEOF property in the Impl? Well, we just add that in our switch statement.

  • //...
    
    func lexProto(l *Impl) stateFn {
      switch r := l.next(); {
      case l.atEOF:
        return l.emit(EOF)
      //...
      }
    }
    

and now:

$ go test ./...
ok      github.com/Clement-Jean/protein/lexer  0.809s

Boom !

Conclusion

We have a Lexer that is able to go through a text and identify the symbols that we defined as TokenType. Admittedly, this is not exceptional but we are now ready for the next part where we are going to skip whitespaces and comments, and lex Identifiers, Numbers (Int and Float), and Strings.

If you like this kind of content let me know in the comment section and feel free to ask for help on similar projects, recommend the next post subject or simply send me your feedback.

Written on February 17, 2023