An introduction to Protocol Buffers 3

Blog

May 9, 2021

Protobuf logo

A little bit of context first:

I’ve spent most of my professional life building RESTful APIs. Recently I had to build a gRPC service, and learning protobuf for the first time seemed to be a daunting task. It turns out, Protocol Buffer is very easy to learn, feature-rich, and can speed up development significantly.

If you are looking into protobuf for the first time, my goal in this post is to give you an initial overview of the Protocol Buffer 3 syntax, structure, data types, and best practices.

What is Protocol Buffer (Protobuf)?

Protocol Buffers (Protobuf) is a free and open-source cross-platform library used to serialize structured data. It was created by Google to be used internally in 2001 and has been made public since 2008.

Google defines protobuff as:

Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data - think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

What problem is it solving, and why it matters?

Like JSON and XML, Protobuf is a way to interchange data. It is easier to read than most other formats. Becuase it is compiled into a binary, it is also significantly faster and more structured. It is a popular choice for RPC systems.

A few advantages of using Protobuf:

Much faster than XML and JSON. (due to small message sizes)
Data is fully typed using a .proto schema.
Code can easily be auto-generated ¹.
The Schema supports comments/documentation.
Language-neutral. Works across many languages.
Tools like buf can detect breaking changes, lint, and easily manage dependencies.

Protocol Buffers are Schema Of Messages. They can be converted to binary and converted back to message formats using the code generated by the protoc compiler for various languages.

If you are building systems (microservices) that needs to communicate with each other and would like a simple, fast, easy-to-read, language-agnostic way to build your data exchange layers and contracts, you should give Protobuf a try.

Syntax anatomy

Google - Language Guide (proto3)

To better understand how Protobuf is structured, let us break down a simple example message search.proto:

syntax = "proto3"; // Syntax Of Protobuf Version specification.

// Package name (think of `namespacing` for the schema.)
package example;

// Option annotation (more about this later) 
option go_package = "github.com/victorpierredev/protoanatomy";

message SearchRequest { // message + messageName.
  string query = 1; // field scalar type + field name + field tag number.
  int32 page_number = 2;
  int32 result_per_page = 3;
  repeated string parameters = 4; // the `repeated` keyword presents an array.
  
  enum SearchCategory {
    // When no value is provided, the first value is the default assigned.
    SEARCH_CATEGORY_UNKNOWN = 0; // enum's tag must always start with zero and increment.
    SEARCH_CATEGORY_RENT = 1; // it is good practice to prefix the enum to avoid collision.
    SEARCH_CATEGORY_BUY = 2;
    SEARCH_CATEGORY_LEASE = 3;
  }
  
  SearchCategory search_category = 5; // Enum as the type + field name + tag num.
}

    
  

ℹ️

Protobuf 3 does not have the concept of required and optional fields. Everything is optional. This is a change from Protobuf 2, which allowed you to specify it as optional or required. This was done in order to preserve better compatibility semantics

Option

Your .proto file can be annotated with different options. Different options will affect how your code handles the protocol, depending on the context. For example, if you are using Go you can use option go_package to specify the Go package and avoid having it default to the same name as the .proto package. Same for Java with option java_package and so on.

Other options, such as optimize_for are not language-specific (in this case, it will affect the code generator for Java, C++, and others) and take SPEED, CODE_SIZE, or LITE_RUNTIME as possible values.

Also, according to Google’s documentation:

Some options are file-level options, meaning they should be written at the top-level scope, not inside any message, enum, or service definition. Some options are message-level options, meaning they should be written inside message definitions. Some options are field-level options, meaning they should be written inside field definitions. Options can also be written on enum types, enum values, one of fields, service types, and service methods; however, no useful options currently exist for any of these.

Here is a Complete list of available options. It is also possible to create your own options, but for most people, this will never be needed.

Message

The data structure definition in protobuf is called message.

Tag number

This can be confusing at first, as may look like a value assignment. Tag elements are a very important part of the message, they are used to identify your fields in the message binary format (serialization and deserialization), and should not be changed once the message type is in use. They can range from 1 to 536.870.911 (2^29 - 1). A few things to keep in mind:

Tags from 1 to 15 use 1 byte when the message is encoded and should be used for frequently used fields.
From 16 to 2047, tag numbers will use 2 bytes.
Numbers between 19000 and 19999 cannot be used, as they are reserved.

Scalar type value

This is the field type definition and is similar to many typed languages. You can see the list of supported Scalar types.

Field name

Field names are written in snake_case. Another convention is that If the field name contains a number, the number should appear after the letter instead of after the underscore. e.g., use address1 instead of addresss_1.

Following the suggested naming convention gives you some benefits when generating code.

For repeated fields, because the type represents an array, the field name should be pluralized (e.g. repeated string middle_names).

Default values

When the field is not declared, it will be assigned a default (empty) value, for example:

bool will default to false
string will default to ''
int will default to 0
repeated will default to []
enums will default to the first value

There is also no way to enforce constraint validation in protobuf. For example, let us say we have a field to get the age of a person:

// age must be between 0 and 130
int32 age = 1;

ℹ️

Although the comment above helps us understand what the constraint for the field age should be (between 0 and 130), there is no way to validate/enforce this in protobuf 3 itself and we must rely on the code that uses it to validate the value.

Message definition

👁‍🗨 See: Google - Defining a message

In each .proto file, you can define one or more messages.

In a real application, there is a good chance that you will want to define multiple messages in the same .proto file. Amongst the many benefits of this approach, each message becomes an available type that can be used within another message. For example, let us imagine an appointment booking application, with the following appointments.proto file:

syntax="proto3";

message Booking {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  
  // Note that `Date` is now an available type
  Date schedule = 4;
}

// Booking date that we want to use above.
message Date {
 int32 year = 1;
 int32 month = 2;
 int32 day = 3;
}

    
  

Nested Types

It is also possible to define types within types (nesting), and you can do this with as many levels as needed. Expanding on the example above:

syntax="proto3";

message Booking {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;
  
  // Note that `Date` is now an available type
  Date schedule = 4;
  
  // Note that we are defining a message within a message and only available to its parent message.
  message Status {
    // Note that this message is a new block, so there is no collision.
  
  // The `PaymentStatus` enum is only availabe to `Status`
    enum PaymentStatus {
      PAYMENT_STATUS_UNKNOWN = 0
      PAYMENT_STATUS_PENDING = 1;
      PAYMENT_STATUS_PAID = 2;
      PAYMENT_STATUS_REJECTED = 3;
  }
  
  // The tag number is also relative to `Status` message.
  bool confirmed = 1; 
  bool cancelled = 2;
  PaymentStatus payment_status = 3;
  
  }
}

// Booking date that we want to use above.
message Date {
 int32 year = 1;
 int32 month = 2;
 int32 day = 3;
}

    
  

Importing types

Another way to use message types defined in other files is to import them. Note that in this example, we are also declaring a package ². Again, using the first example, imagine we have two files in different directories:

.
├── types
│   └── date.proto
└── v2
    ├── appointments.proto
    └── patients.proto

    
  

date.proto

syntax = "proto3";

// The concept of packages here is very similar to `packages` in Go.
// It is equivalent to `namespaces` in other languages. They are *optional*.
package types;

message Date {
  int32 year = 1;
  int32 month = 2;
  int32 day = 3;
}

    
  

appointment.proto

syntax = "proto3";

package api.v2;

// Note that we must include the directory path, and package name if
import "types/date.proto";

message Booking {
  string query = 1;
  int32 page_number = 2;
  int32 result_per_page = 3;

  // Because we import date.proto, the type is available.
  // Note that because we've defined a package name, we must use it here.
  // otherwise we would declare it as `Date schedule = 4;`
  types.Date schedule = 4;
}

    
  

I’ll write about Protobuf code generation in a later post. ↩︎
You can learn more about packages and how it affects your programming language of choice from Google’s package documentation ↩︎

Five useful terminal tools for Developers End-to-End tests with Venom