Protein: Lexer (Part 1)
As promised in a LinkedIn Poll, we are going to develop Parser for proto files which will create an AST that is serializable in Protobuf itself. Obviously, this is going to be a series of articles because we need to write quite a lot of parsing code. However, I believe that this is worth doing since we are going to see another use of Protobuf outside of gRPC.
In this article, we are going to present the project and start writing a Lexer. This is not the most amusing part. However, this will be the foundations for our part 2 in which we will start more complex tokenizing of the input.
The Project
The project is called protein. In itself, this is intended to be a project showing how to parse proto files to do tools for Protobuf. But the end goal is to provide an easy-to-use and efficient linter.
The project is mostly divided into three components:
- The Lexer: it tokenizes the input and provides metadata about these tokens.
- The Parser: it verifies the validity of our tokens and builds an AST.
- The Linter: it analyzes the AST and provide feedback to users.
Note: There might be more components later but these are the major ones.
One thing worth mentioning is that the AST will be built with objects generated by Protobuf. This means that we will have an AST that we can serialize and maybe even cache for multiple runs of our Linter.
Finally, as mentioned, this project shows how to build tools for Protobuf. Thus I hope the readers will be inspired to create their own tools and/or contribute to this project. I'm going to document all the code and explain all the logic so that you can follow along and even participate in the development.
The Lexer
The protein
Lexer has been inspired by one of the great pieces of code out there: The template library lex.go in the Go standard library. I was introduced to this code by an amazing talk given by Rob Pike. So I want to thank all the contributors and Rob for the amazing talk. Thank you guys!
The essence of the Lexer is a Finite-state Machine (FSM). This basically means that we are going to go from state to state by applying functions to do the transition. For example, let's say that we have the following line:
syntax = "proto3";
We will go through the following states:
initial state -> Identifier("syntax") -> Equal -> Str("proto3") -> EOF
And each state will know how to parse their own value (e.g. "proto3" for the Str state).
Finally, the Lexer will be used by calling the NextToken
function. This function, as its name suggests, will return the next token in the input. This is basically calling all the parsing functions from each state.
Let's Start Simple
Let's fullfil the first requirement. The user code, in our case the Parser, will interact with the Lexer by repeatedly calling the NextToken
function. So let's define an interface because that will be handy later on in this series to test the Parser on a fake Lexer.
-
package lexer // Lexer is protein's tokenizer type Lexer interface { // NextToken returns the following token in the input source. NextToken() Token }
For now, "input source" is still a little vague but we are going to deal with that later in the implementation.
After that, we do not have the definition of a Token
. In our application, a token is a collection of few information about some text in the input:
- A type: a simple way to identify tokens (e.g., Identifier, Str, ...)
- A literal: the value of a token as a string
- A position: the position of this text in the input
Let's define all that. Let's start with position. We want to know the offset relative to the beginning of the file and we want to know the line and column at which the token shows up. These info are for making the error messages clearer to the user (at least we'll try our best!). Instead of having something like:
unterminated string
we want to have:
file.proto 1:10 unterminated string: "this is unterminated
Where, in this example 1 is the line and 10 is the column.
So a position is pretty simple.
-
package lexer // Position is a position in the input type Position struct { // Offset is the position relative to the beginning of the file (starts at 0) Offset int // Line is the file line (starts at 1) Line int // Column is the offset relative to the beginning of the line (starts at 0) Column int }
And with that we can now define our Token.
-
package lexer // TokenType is an alias type which tells of which kind the token is type TokenType int // Token is a piece of the input type Token struct { Type TokenType Literal string Position }
Token Types
Right now, we simply have the TokenType
type alias but do not define any value. We are going to define an enum for all the possible values that the Token.Type
property can have.
-
//... // These are all the token types const ( EOF TokenType = iota - 1 // End Of File TokenIllegal // Illegal token TokenError // Error TokenSpace // Space (whitespace, '\n', '\r', '\t') TokenComment // Comment (single line or multiline) TokenIdentifier // Identifier TokenInt // Integer TokenFloat // Float TokenStr // String ('...' or "...") TokenUnderscore // _ TokenEqual // = TokenColon // , TokenSemicolon // ; TokenDot // . TokenLeftBrace // { TokenRightBrace // } TokenLeftSquare // [ TokenRightSquare // ] TokenLeftParen // ( TokenRightParen // ) TokenLeftAngle // < TokenRightAngle // > )
There are few things to notice here. The first one is the use of iota
. If you are not familiar with this, this is a keyword that lets us define multiple constant with consecutive values. Here, we use iota - 1
because, by default, iota
starts counting at zero but here we want to start at -1. The second thing to notice is that some of these tokens have constant values (e.g., TokenEqual
== '=') and others don't have one (e.g., Identifier value will be evaluated at runtime).
For debugging and testing, we will need a way to display the TokenType
as a string. In Go, we can simply do that by defining a function called String()
on the TokenType
type. And to define the value we are going to have a constant array defining values based on the type (remember TokenType
is just an int).
-
//... var tokenTypeStr = [...]string{ "EOF", "Illegal", "Error", "Space", "Comment", "Identifier", "Integer", "Float", "String", "_", "=", ",", ";", ".", "{", "}", "[", "]", "(", ")", "<", ">", } func (t TokenType) String() string { return tokenTypeStr[t+1] // +1 because we start at iota - 1 }
Implementing a Basic Lexer
With all of the previous boilerplate, we can now start our implementation. Don't get too excited though, we are simply going to define an "empty shell" Lexer in order to be able to write a failing test (see Red/Green testing).
As we know, we need to implement the NextToken
function.
-
package lexer type Impl struct{} func (l *Impl) NextToken() Token { return Token{ Type: TokenIllegal, } }
Furthermore, we need a simple way to instantiate our Lexer. To do so, we are going to create a function called New
with a return type being Lexer
and will return an instance of Impl
.
-
//... // New creates a new instance of the Lexer func New() Lexer { return &Impl{} }
Writing a Failing Test
Now let's start with a simple test. We are going to test all the symbols (Underscore, Equal, ...). At the end of this article, we aim to make that test pass.
We are going to use a technique called table-driven tests. To do so, we are going to define a type that represents an expected Token.
-
type Check struct { expectedType TokenType expectedLiteral string expectedPosition Position }
I'm calling it Check
because it will be used as for a single check in a list of checks. We are going to iterate over that list of checks and do asserts on the three properties.
-
func runChecks(t *testing.T, l Lexer, tests []Check) { for i, tt := range tests { tok := l.NextToken() if tok.Type != tt.expectedType { t.Fatalf("tests[%d] - tokentype wrong. expected=%q, got=%q", i, tt.expectedType, tok.Type) } if tok.Literal != tt.expectedLiteral { t.Fatalf("tests[%d] - literal wrong. expected='%s', got='%s'", i, tt.expectedLiteral, tok.Literal) } // asserts on Position for later } }
With that we can now write a compiling and failing test.
-
//... import ( "testing" ) //... func TestNextTokenOnSymbols(t *testing.T) { runChecks(t, New(), []Check{ {TokenUnderscore, "_", Position{}}, {TokenEqual, "=", Position{}}, {TokenColon, ",", Position{}}, {TokenSemicolon, ";", Position{}}, {TokenDot, ".", Position{}}, {TokenLeftBrace, "{", Position{}}, {TokenRightBrace, "}", Position{}}, {TokenLeftSquare, "[", Position{}}, {TokenRightSquare, "]", Position{}}, {TokenLeftParen, "(", Position{}}, {TokenRightParen, ")", Position{}}, {TokenLeftAngle, "<", Position{}}, {TokenRightAngle, ">", Position{}}, {EOF, "", Position{}}, }) }
and if we test:
$ go test ./...
--- FAIL: TestNextTokenOnSymbols (0.00s)
lexer_test.go:19: tests[0] - tokentype wrong. expected="_", got="Illegal"
FAIL
That was expected.
Improving Impl
If you didn't notice, our implementation for the Lexer is useless. It's not storing any state and it doesn't even process text. Let's change that.
Let's first look at the states our Lexer will store.
-
// Impl is the implementation for the Lexer interface. type Impl struct { src string // the input text start int // the start of a token startLine int // the line at which a token start startLineOffset int // the offset of the starting line relative to beginning of file line int // the current file line being process pos int // the reading position in the file atEOF bool // tells wether the Lexer is finished token Token // the token to return }
That seems like a lot but don't worry, this is actually pretty intuitive once we start interacting with these properties.
Now, we can improve our New
function a little bit.
-
//... // New creates a new instance of the Lexer func New(input string) Lexer { return &Impl{ src: input, line: 1, // lines are 1-indexed startLine: 1, } }
We are finally storing text and some indices to keep track of where we are in it.
We can now start to think about emitting token in the NextToken
function but before actually emitting tokens we are going to need thinking about the FSM.
As mentioned, we are going to create states and transitions between these states. These states will be represented as functions and the transitions will be made by returning another state after the processing needed for the current state.
We define a state as follows:
-
//... type stateFn func(*Impl) stateFn
You can notice that this is a function, taking an Impl
as parameter and returning a stateFn. Think about is as a list of states linked together.
With that, we are now able to define the NextToken
logic.
-
//... // NextToken provides the following token in the input func (l *Impl) NextToken() Token { state := lexProto for { state = state(l) if state == nil { return l.token } } }
Where lexerProto
is the initial state and where we iterate until state
it becomes nil
.
lexProto
lexProto
Is the initial state of our FSM. This is a stateFn
. This means that it interacts with Impl
and returns another stateFn
.
For now, we can make this function really simple. It is a function that checks the next character and checking if this character is a symbol. If it is, we return a Token
with the relevant TokenType
, otherwise we return an Illegal Token
.
-
func lexProto(l *Impl) stateFn { switch r := l.next(); { case r == '_': return l.emit(TokenUnderscore) case r == '=': return l.emit(TokenEqual) case r == ',': return l.emit(TokenColon) case r == ';': return l.emit(TokenSemicolon) case r == '.': return l.emit(TokenDot) case r == '{': return l.emit(TokenLeftBrace) case r == '}': return l.emit(TokenRightBrace) case r == '[': return l.emit(TokenLeftSquare) case r == ']': return l.emit(TokenRightSquare) case r == '(': return l.emit(TokenLeftParen) case r == ')': return l.emit(TokenRightParen) case r == '<': return l.emit(TokenLeftAngle) case r == '>': return l.emit(TokenRightAngle) } return l.emit(TokenIllegal) }
next()
next
Is also pretty simple. It basically checks if the position is valid. If it isn't it returns EOF
, otherwise it takes the next rune (utf8 character), updates the pos
, updates the line
if needed, and return the rune read.
-
func (l *Impl) next() rune { if int(l.pos) >= len(l.src) { l.atEOF = true return rune(EOF) } r, w := utf8.DecodeRuneInString(l.src[l.pos:]) l.pos += w if r == '\n' { l.line++ } return r }
emit(TokenType)
emit
Is a function that checks our current location in the input text, creates a token, and return nil to stop the loop in NextToken
.
-
func (l *Impl) emit(tt TokenType) stateFn { t := Token{ Type: tt, Literal: l.src[l.start:l.pos], Position: Position{ Offset: l.start, Line: l.startLine, Column: l.start - l.startLineOffset, }} l.start = l.pos l.startLine = l.line if tt == TokenSpace && strings.Contains(t.Literal, "\n") { l.startLineOffset = l.start } l.token = t return nil }
Note: this is a lot of small operations. Take the time to go through the comments written in the
Impl
struct. This is not that hard.
Handling EOF
If we run our test now, we will get the following:
$ go test ./...
--- FAIL: TestNextTokenOnSymbols (0.00s)
lexer_test.go:20: tests[13] - tokentype wrong. expected="EOF", got="Illegal"
FAIL
This is basically saying: "we received an Illegal Token but we expected EOF".
This is pretty trivial to solve. Remember the atEOF
property in the Impl
? Well, we just add that in our switch statement.
-
//... func lexProto(l *Impl) stateFn { switch r := l.next(); { case l.atEOF: return l.emit(EOF) //... } }
and now:
$ go test ./...
ok github.com/Clement-Jean/protein/lexer 0.809s
Boom !
Conclusion
We have a Lexer that is able to go through a text and identify the symbols that we defined as TokenType. Admittedly, this is not exceptional but we are now ready for the next part where we are going to skip whitespaces and comments, and lex Identifiers, Numbers (Int and Float), and Strings.
If you like this kind of content let me know in the comment section and feel free to ask for help on similar projects, recommend the next post subject or simply send me your feedback.