Protein: Parser (Part 1)
In this article we are going to finally get to building the Parser. We are going to start parsing syntax, package and import statements, and we are going to see how to represent our serializable AST. Hope you are ready for this, it's gonna be fun!
Boilerplate
As always, we need to think a little bit before to actually write the features themselves. The first thing that we can do to get us started is to write the Parser interface.
-
package parser // Parser is protein's parser type Parser interface { // Parse returns ??? Parse() ??? }
This doesn't seem like a fancy interface but we do have a problem. What is our parser returning when finished? Well, it should return an AST, right? But how do we represent this AST. It turns out, we have two good possibilities:
- We roll our own serializable AST where each object is a Protobuf Message.
- We use the descriptor.proto file which defines Messages for describing elements in a Protobuf file.
Both have pros and cons. If we go with the first one we have more control over our serialization. It means that we can optimize some elements' serialized data. However, it also means that we are not compatible with the official way and that's not good. For the "using the official serialization", I think you get the idea. We have the pros being the cons of the other implementation, and the cons being the pros of the other implementation.
In the end, for the sake of compatibility, I will be sacrificing some performance. However, these performances are only saving a few bytes and having compatibility with programs serialized by protoc far overweights them.
Depending on Protobuf
To use the descriptor, we are going to depend on Protobuf's library. To do that we are going to add in our dependency:
$ go get google.golang.org/protobuf
This will let us access descriptorpb
package, which contains the FileDescriptorProto
struct. If you look at the definition of that struct, you will see the following comment:
// Describes a complete .proto file.
type FileDescriptorProto struct
That's exactly what we are trying to do.
Back to interface
With that dependency on Protobuf, we can now finish our interface:
-
package parser import pb "google.golang.org/protobuf/types/descriptorpb" // Parser is protein's parser type Parser interface { // Parse returns the representation of a file in Protobuf Descriptor Parse() pb.FileDescriptorProto }
Implementation
Let's now implement the interface. But by now you know the drill. We are going to create a minimal implementation so that our first test fails. So what we need is an Impl
struct and we need to implement Parser
by writing the Parse
function.
For now, the Parse
function will simply return an empty FileDescriptorProto
.
-
package parser import ( pb "google.golang.org/protobuf/types/descriptorpb" ) // Impl is the implementation for the Parser interface. type Impl struct { l lexer.Lexer } // New creates a new instance of the Parser func New(l lexer.Lexer) Parser { p := &Impl{l: l} return p } // Parse populates a FileDescriptorProto func (p *Impl) Parse() pb.FileDescriptorProto { d := pb.FileDescriptorProto{} return d }
First test
As our first test we are going to create the test for a syntax statement. This test will take advantage of the fact that Lexer
is an interface by creating a FakeLexer
. This fake lexer will simply iterate over an array of tokens and return them one by one.
-
package parser import ( "github.com/Clement-Jean/protein/lexer" ) type FakeLexer struct { i int tokens []lexer.Token } func (l *FakeLexer) NextToken() lexer.Token { if l.i >= len(l.tokens) { return lexer.Token{Type: lexer.EOF, Position: lexer.Position{}} } token := l.tokens[l.i] l.i++ return token }
This basically means that each time we are running a test in the parser, we are not going to run the lexer code. We are going to simply focus on our current features. So, if we encounter a bug, it means that it is in the parser, not anywhere else.
With that, we can write our first test for syntax = "proto3";
.
-
package parser import ( "testing" "github.com/Clement-Jean/protein/lexer" ) func TestParseSyntaxProto3(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "syntax"}, {Type: lexer.TokenEqual, Literal: "="}, {Type: lexer.TokenStr, Literal: "\"proto3\""}, {Type: lexer.TokenSemicolon, Literal: ";"}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := "proto3" if syntax := d.GetSyntax(); syntax != expected { t.Fatalf("syntax wrong. expected='%s', got='%s'", expected, syntax) } }
Obviously:
$ go test ./...
--- FAIL: TestParseSyntaxProto3 (0.00s)
parser_syntax_test.go:22: syntax wrong. expected='proto3', got=''
FAIL
Parsing
We should now improve the Parse
function to consume the Lexer
's tokens and do things with that. Here is the pseudo code:
for currToken.Type != EOF {
if currToken.Type == Identifier {
fn := parseFuncs[curToken.Literal] // find the function depending on keyword
fn(&descriptor) // populate the descriptor
}
}
You notice that we need a currToken
representing the current token being parsed. We will also need the peek token for parsing syntax and other statements. This is because we are going to make sure each time that the peek token is correct, otherwise we will return an error. So Impl
now has a currToken
and peekToken
:
-
type Impl struct { l lexer.Lexer curToken lexer.Token peekToken lexer.Token }
Now, we need to populate these tokens before being able to use them. The first time we need to initialize them is in the New
function.
-
func New(l lexer.Lexer) Parser { p := &Impl{l: l} p.nextToken() p.nextToken() return p }
But nextToken
is not the Lexer.NextToken
, this is a private function in Parser
. This is a function that looks for the next non-space token.
-
func (p *Impl) nextToken() { for p.curToken = p.peekToken; p.curToken.Type == lexer.TokenSpace; p.curToken = p.l.NextToken() { } for p.peekToken = p.l.NextToken(); p.peekToken.Type == lexer.TokenSpace; p.peekToken = p.l.NextToken() { } }
With that we can start updating our Parse
function.
-
func (p *Impl) Parse() pb.FileDescriptorProto { d := pb.FileDescriptorProto{} for p.curToken.Type != lexer.EOF { if p.curToken.Type == lexer.TokenIdentifier { //Do something with token } p.nextToken() } return d }
Finally, we are going to register all the parsing functions that we are gonna write in this and next articles. We are going to have a map mapping "syntax" to parseSyntax, ...
-
var parseFuncs = map[string]func(p *Impl, d *pb.FileDescriptorProto){ "syntax": func(p *Impl, d *pb.FileDescriptorProto) { d.Syntax = p.parseSyntax() }, }
With this we can finalize the Parse
function by looking at the relevant function for the currToken.Literal
.
-
func (p *Impl) Parse() pb.FileDescriptorProto { d := pb.FileDescriptorProto{} for p.curToken.Type != lexer.EOF { if p.curToken.Type == lexer.TokenIdentifier { fn, ok := parseFuncs[p.curToken.Literal] if !ok { // keyword not found break } fn(p, &d) } p.nextToken() } return d }
parseSyntax()
Before actually parsing a syntax statement, we need two helper functions: accept
and acceptPeek
. acceptPeek
will just call accept
with the peekToken.Type
. accept
take a TokenType
and checks if it exists in the following variadic arguments.
-
func (p *Impl) accept(original lexer.TokenType, expected ...lexer.TokenType) bool { if !slices.Contains(expected, original) { // TODO: add error return false } p.nextToken() return true } // acceptPeek returns true and advance token // if tt contains the peekToken.Type // else it returns false func (p *Impl) acceptPeek(tt ...lexer.TokenType) bool { return p.accept(p.peekToken.Type, tt...) }
And now we ready for our parseSyntax
function. We are first going to check that we have an =
after syntax. Then we check that we have a String, if it’s the case we are going to take the value inside the quotes. And finally, we are going to check that there is a semicolon at the end of the statement.
-
package parser import ( "github.com/Clement-Jean/protein/lexer" ) func (p *Impl) parseSyntax() *string { if !p.acceptPeek(lexer.TokenEqual) { return nil } if !p.acceptPeek(lexer.TokenStr) { return nil } s := destringify(p.curToken.Literal) if !p.acceptPeek(lexer.TokenSemicolon) { return nil } return &s }
The destringify
function looks like the following:
-
package parser import "strings" func destringify(s string) string { return strings.TrimFunc(s, func(r rune) bool { return r == '\'' || r == '"' }) }
As mentioned, it takes the values between the quotes.
With our parseSyntax
finished, we can rerun the test and:
$ go test ./...
ok github.com/Clement-Jean/protein/parser 1.361s
parseImport()
parseImport
is really similar to parseSyntax
. However, with imports, we get introduced to optional keywords. An import can be written in 3 ways:
import "my.proto";
import public "my.proto";
import weak "my.proto";
Let's write the tests:
-
package parser import ( "testing" "github.com/Clement-Jean/protein/lexer" ) func TestParseImport(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "import"}, {Type: lexer.TokenStr, Literal: "\"google/protobuf/empty.proto\""}, {Type: lexer.TokenSemicolon, Literal: ";"}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := []string{"google/protobuf/empty.proto"} public := []int32{} weak := []int32{} if imp := d.GetDependency(); slices.Compare(imp, expected) != 0 { t.Fatalf("import wrong. expected='%v', got='%v'", expected, imp) } if p := d.GetPublicDependency(); slices.Compare(p, public) != 0 { t.Fatalf("public import wrong. expected='%v', got='%v'", public, p) } if w := d.GetWeakDependency(); slices.Compare(w, weak) != 0 { t.Fatalf("weak import wrong. expected='%v', got='%v'", weak, w) } } func TestParsePublicImport(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "import"}, {Type: lexer.TokenIdentifier, Literal: "public"}, {Type: lexer.TokenStr, Literal: "\"google/protobuf/empty.proto\""}, {Type: lexer.TokenSemicolon, Literal: ";"}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := []string{"google/protobuf/empty.proto"} public := []int32{0} weak := []int32{} if imp := d.GetDependency(); slices.Compare(imp, expected) != 0 { t.Fatalf("import wrong. expected='%v', got='%v'", expected, imp) } if p := d.GetPublicDependency(); slices.Compare(p, public) != 0 { t.Fatalf("public import wrong. expected='%v', got='%v'", public, p) } if w := d.GetWeakDependency(); slices.Compare(w, weak) != 0 { t.Fatalf("weak import wrong. expected='%v', got='%v'", weak, w) } } func TestParseWeakImport(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "import"}, {Type: lexer.TokenIdentifier, Literal: "weak"}, {Type: lexer.TokenStr, Literal: "\"google/protobuf/empty.proto\""}, {Type: lexer.TokenSemicolon, Literal: ";"}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := []string{"google/protobuf/empty.proto"} public := []int32{} weak := []int32{0} if imp := d.GetDependency(); slices.Compare(imp, expected) != 0 { t.Fatalf("import wrong. expected='%v', got='%v'", expected, imp) } if p := d.GetPublicDependency(); slices.Compare(p, public) != 0 { t.Fatalf("public import wrong. expected='%v', got='%v'", public, p) } if w := d.GetWeakDependency(); slices.Compare(w, weak) != 0 { t.Fatalf("weak import wrong. expected='%v', got='%v'", weak, w) } }
Obviously:
$ go test ./...
--- FAIL: TestParseImport (0.00s)
parser_import_test.go:25: import wrong. expected='[google/protobuf/empty.proto]', got='[]'
--- FAIL: TestParsePublicImport (0.00s)
parser_import_test.go:52: import wrong. expected='[google/protobuf/empty.proto]', got='[]'
--- FAIL: TestParseWeakImport (0.00s)
parser_import_test.go:79: import wrong. expected='[google/protobuf/empty.proto]', got='[]'
FAIL
Even though the 2nd and 3rd one are rarely used, we still need to support them. To do so, we are going to need to create an enum called DependencyType
which will have the variants: None, Public, and Weak. After that, we are going to check if we have an identifier and depending on the Literal
, we are going to return the type.
-
package parser import ( "fmt" "github.com/Clement-Jean/protein/lexer" ) type DependencyType int const ( None DependencyType = iota Public Weak ) func (p *Impl) parseImport() (string, DependencyType) { if !p.acceptPeek(lexer.TokenStr, lexer.TokenIdentifier) { return "", None } depType := None if p.curToken.Type == lexer.TokenIdentifier { switch p.curToken.Literal { case "public": depType = Public case "weak": depType = Weak default: return "", None } if !p.acceptPeek(lexer.TokenStr) { // TODO: add error return "", None } } s := destringify(p.curToken.Literal) if !p.acceptPeek(lexer.TokenSemicolon) { return "", None } return s, depType }
And the last thing we need to do is register that to the parseFuncs
.
-
var parseFuncs = map[string]func(p *Impl, d *pb.FileDescriptorProto){ "syntax": func(p *Impl, d *pb.FileDescriptorProto) { d.Syntax = p.parseSyntax() }, "import": func(p *Impl, d *pb.FileDescriptorProto) { dep, t := p.parseImport() if len(dep) != 0 { i := int32(len(d.Dependency)) d.Dependency = append(d.Dependency, dep) switch t { case Public: d.PublicDependency = append(d.PublicDependency, i) case Weak: d.WeakDependency = append(d.WeakDependency, i) } } }, }
We basically append the dependency and if we have a public or weak dependency we add its index into PublicDependency
and WeakDependency
respectively.
We rerun our test:
$ go test ./...
ok github.com/Clement-Jean/protein/parser 0.450s
parsePackage()
Once again this function is pretty similar. The main difference is that we are going to look for identifiers and fully qualified names (identifiers separated by dots).
Let's write some tests.
-
package parser import ( "testing" "github.com/Clement-Jean/protein/lexer" ) func TestParsePackage(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "package", Position: lexer.Position{}}, {Type: lexer.TokenIdentifier, Literal: "google", Position: lexer.Position{}}, {Type: lexer.TokenSemicolon, Literal: ";", Position: lexer.Position{}}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := "google" if pkg := d.GetPackage(); pkg != expected { t.Fatalf("package wrong. expected='%s', got='%s'", expected, pkg) } } func TestParsePackageFullIdentifier(t *testing.T) { tokens := []lexer.Token{ {Type: lexer.TokenIdentifier, Literal: "package", Position: lexer.Position{}}, {Type: lexer.TokenIdentifier, Literal: "google", Position: lexer.Position{}}, {Type: lexer.TokenDot, Literal: ".", Position: lexer.Position{}}, {Type: lexer.TokenIdentifier, Literal: "protobuf", Position: lexer.Position{}}, {Type: lexer.TokenSemicolon, Literal: ";", Position: lexer.Position{}}, } l := &FakeLexer{tokens: tokens} p := New(l) d := p.Parse() expected := "google.protobuf" if pkg := d.GetPackage(); pkg != expected { t.Fatalf("package wrong. expected='%s', got='%s'", expected, pkg) } }
Let's fail our tests:
$ go test ./...
--- FAIL: TestParsePackage (0.00s)
parser_package_test.go:21: package wrong. expected='google', got=''
--- FAIL: TestParsePackageFullIdentifier (0.00s)
parser_package_test.go:39: package wrong. expected='google.protobuf', got=''
FAIL
And now we can implement the parsePackage
function. We are going to check that we have at least one identifier, and then if we have a dot we are going to make sure that we have another identifier after. Finally, we will be looking for the semicolon.
-
package parser import ( "strings" "github.com/Clement-Jean/protein/lexer" ) func (p *Impl) parsePackage() *string { if !p.acceptPeek(lexer.TokenIdentifier) { return nil } var parts []string for p.curToken.Type == lexer.TokenIdentifier { parts = append(parts, p.curToken.Literal) if p.peekToken.Type != lexer.TokenDot { break } p.nextToken() if !p.acceptPeek(lexer.TokenIdentifier) { return nil } } s := strings.Join(parts, ".") if !p.acceptPeek(lexer.TokenSemicolon) { return nil } return &s }
The last thing to do is to register this function in our parseFuncs
.
-
var parseFuncs = map[string]func(p *Impl, d *pb.FileDescriptorProto){ "syntax": func(p *Impl, d *pb.FileDescriptorProto) { d.Syntax = p.parseSyntax() }, "package": func(p *Impl, d *pb.FileDescriptorProto) { d.Package = p.parsePackage() }, "import": func(p *Impl, d *pb.FileDescriptorProto) { dep, t := p.parseImport() if len(dep) != 0 { i := int32(len(d.Dependency)) d.Dependency = append(d.Dependency, dep) switch t { case Public: d.PublicDependency = append(d.PublicDependency, i) case Weak: d.WeakDependency = append(d.WeakDependency, i) } } }, }
and we rerun our tests.
$ go test ./...
ok github.com/Clement-Jean/protein/parser 0.847s
Conclusion
In this article, we created our first three parsing functions. We parsed syntax, package and imports. We are now ready to go more complicated statements.
If you like this kind of content let me know in the comment section and feel free to ask for help on similar projects, recommend the next post subject or simply send me your feedback.