By Ugorji Nwoke 17 Dec 2014 (updated 01 Jul 2019) /blog technology go-codec

How go-codec achieves its stellar performance

View articles in the go-codec series, source at http://github.com/ugorji/go

They say premature optimization is the root of all evil. I say some layers of the stack MUST be optimal. The layer that does marshalling of data MUST be optimal.

go-codec library supports code generation OR runtime reflection for its encoding and decoding. We will mostly discuss the runtime reflection in this article. It is easiest to compare this to other libraries.

Let’s look at some benchmark results

It is hard to do a good comparison, as they are only a few feature-rich libraries for each of these formats. In this sense, a feature-rich library must be able to encode/decode to/from any supported value e.g. map[string]string, struct{…}, interface{}.

Surprisingly, only a few libraries fit that bill. go-codec provides Msgpack and Json above, while _StdJson is provided in the standard library and Bson is provided by mongodb/Gustavo. Msgpack is more efficient that bson, but that is the closest I could find in terms of peers. We can get a sense of the performance by comparing json to std_json, and comparing msgpack to bson.

We start by showing some benchmark results using runtime reflection (with support for interfaces), which compare binary formats (go-codec msgpack vs bson) and text formats (go-codec json vs encoding/json).

Raw Data:

ENCODE
Benchmark__Json_______Encode-8         	    6051	    188551 ns/op	    3256 B/op	      44 allocs/op
Benchmark__Std_Json___Encode-8         	    5514	    218973 ns/op	   74474 B/op	     444 allocs/op

Benchmark__Msgpack____Encode-8         	   14095	     84318 ns/op	    3192 B/op	      44 allocs/op
Benchmark__Bson_______Encode-8         	    4936	    239069 ns/op	  222828 B/op	     364 allocs/op

DECODE
Benchmark__Json_______Decode-8         	    3105	    390624 ns/op	   89300 B/op	    1041 allocs/op
Benchmark__Std_Json___Decode-8         	    1365	    855218 ns/op	  138558 B/op	    3032 allocs/op

Benchmark__Msgpack____Decode-8         	    5866	    203320 ns/op	   67387 B/op	     913 allocs/op
Benchmark__Bson_______Decode-8         	    2582	    467415 ns/op	  183853 B/op	    4085 allocs/op

Let’s organize the results in a table form, using go-codec’s implementations as a baseline.

-	Baseline	Time	Memory	Allocations
Stdlib Json (encode)	Json (encode)	1.16 X	22.87 X	10.1 X
Stdlib Json (decode)	Json (decode)	2.18 X	1.55 X	2.9 X
Bson (encode)	Msgpack (encode)	2.83 X	69.8 X	8.2 X
Bson (decode)	Msgpack (decode)	2.30 X	2.7 X	4.47 X

The numbers are clear.

But how do we get such performance, when only depending on information available at runtime to encode/decode?

Minimize use of expensive constructs

For example, the following constructs are more expensive than expected:

defer/recover
method values: e.g. var r io.Reader; fn := r.Read

Within the library, we use panics to signal unrecoverable errors. This way, the internals of the library do not have to keep checking error values.

We noticed that recovering at any point caused a clear increase in encode/decode time. We now only recover at the top-level Encode/Decode function.

Understand comparison, especially when using maps

Equality/Comparison is different for different types of values:

for strings, compare the lengths, and then the byte contents
for structs, compare fields

Equality is really only cheap for numeric types.

To use a value as a key, it must implement comparison, and that can have an impact on performance.

For efficient encoding/decoding, we need to map each reflect.Type to a given function. This takes care of the following:

A type that is a encoding.BinaryMarshaler, and so we need to call its MarshalBinary() method.
A type whose underlying type is a known primitive (e.g. int, string, etc) and so we need to call the function for encoding/decoding primitive.
A type which has been registered as an extension type, and so we need to call the extension functions defined for it.

The typical way to do this, is by using a map[reflect.Type]func.

However, equality as defined for interfaces is quite expensive, so looking up values in the map as we walk a value (e.g. a struct with 16 fields) adds a lot to the time.

We ended up using a unique integer to represent each reflect.Type, and it became easy to look up functions using that integer, as opposed to doing reflect.Type comparisons.

var i int 
rt := reflect.TypeOf(i)
rtid := reflect.ValueOf(rt).Pointer()
// each reflect.Type returns the same value for the call above.

Judicious use of maps

We initially used maps extensively for the following:

keep mapping of integer id (representing reflect.Type) to encode/decode function
keep mapping of integer id (representing reflect.Type) to computed information about that type e.g.
- does it implement encoding.BinaryMarshaler,
- what names are to be used for each Field in the struct

We noticed that maps increase the number of allocations and memory used. It seems that go’s map implementation makes heavy use of the heap under the hood (for buckets, etc).

We moved to using linear or binary search across the board.

Fast, non-reflect path for common map and slice types

Many structs in practice will have slices or maps of interface{}, integers, floats, booleans and string.

It is expensive to iterate through a map or slice using reflection.

It is much faster to cast into that specific type and then iterate regularly (e.g. using range), or accessing the value directly.

To support this, we generate fast-path code for all combinations of maps and slices e.g. map[string]int, []interface{}, []bool, map[uint16]bool, etc.

When we see that the type or underlying type matches one of these, we directly call the fast-path encode or decode function.

Using type-switch

A type-switch on the interface{} allows us to bypass reflection for a lot of common cases. We can detect the concrete type for common types of primitives (number, bool, string, []byte), and for slices and maps (for which we generated concrete implementations already).

We use a type-switch to figure out which type matches. The type-switch implementation makes this a binary search over constants, which allows us effectively jump to the concrete implementation of the encode/decode function quickly.

Be judicious in heap allocation use

To reduce heap allocations, we carefully embed most needed values inside the primary value. We also embed a scratch buffer which is shared during the encoding and decoding phases. This reduces the number of allocations and number of bytes allocated, while helping ensure that data that is used together is aligned together in memory.

For example, codec.Decoder looks like:

type Decoder struct {
	r  decReader // this interface value ends up being either &rb or &ri below.
	rb bytesDecReader
	ri ioDecReader
	b  [scratchByteArrayLen]byte // array used for transient []byte
                                 // e.g. reading floats, bigEndian encoding numbers,
                                 // reading a short string, etc.
}

Interface conversion

You should only use a pointer if absolutely necessary. Pointer use introduces dereferencing cost and allocation and heap interaction which could be avoided.

You should also understand where heap allocation may occur.

Converting a non-pointer value into an interface{} now does an allocation. From Go 1.4, all interfaces contain pointers only. Consequently, the following innocuous code below does allocations (as denoted by comment).

var i interface{} = 5 // allocation happens here
m := make(map[interface{}]interface{})
m[2]=true // 2 allocations happen here, at least, for converting values to interface{}

We are now more careful in use of interfaces and pointers.

Heap Allocation and Pointer dereference

We judiciously honor some tested performance helpers

Functions which have no stack (no parameters or results and no stack allocation) are much faster to call.
Using a pointer receiver doesn’t affect performance, as the pointer receiver is typically placed in a register.
Be cautious when determining whether to use a pointer or value receiver

Zero-copy mode

When encoding into or decoding from a []byte, we do not just wrap it in an io.Reader/io.Writer. Instead, we work with the []byte directly, and only create copies of the data when the data leaves the control of the library. For example, if a user wants to decode a []byte from the the stream, only then do we make a copy of it and return.

Assist the inliner

Inlining helps eliminate function call overhead.

Currently, the gc compiler does very primitive inlining, and will not inline the following many functions which are not very simple. Within the code, we are aware of these and try to help the inliner so most hot functions are inlineable.

For example, switch statements and for loops prevent a function from being inlined. In the code, we may use if statements and goto instead to ensure they are inlined. Also, function which call other functions generally cannot be inlined if they do any more than a simple assignment or conditional.

Assist the Bounds-Check Elimination pass

Go will do a bounds-check on each slice or array access unless it can prove that the access is within bounds.

We judiciously assist the compiler to eliminate as many bounds checks as possible.

See https://docs.google.com/document/d/1vdAEAjYdzjnPA9WDOQ1e4e05cYVMpqSxJYZT33Cqw2g and https://www.ardanlabs.com/blog/2018/04/bounds-check-elimination-in-go.html

(Optional) use of unsafe

This package understands the unsafe tag, to allow using unsafe semantics:

When decoding into a struct, you need to read the field name as a string so you can find the struct field it is mapped to. Using unsafe will bypass the allocation and copying overhead of []byte->string conversion.
fast-path for some reflection and map iteration use-cases to reduce allocation overhead or unnecessary conditionals.

Users can pass the -tags=safe option during install to use default safe semantics of the go runtime and the reflection package. By default, unsafe is used to give the performance improvements.

Judicious use of sync.Pool

Most encoding formats (msgpack, binc, cbor, etc) prefix the length of a container before encoding the contents of the container.

To support omitEmpty, we need to gather all fields of a struct, figure out which ones should be omitted, then encode the length of the ones to be encoded before encoding starts. In code, we needed to create transient slices just to collect the ones to be encoded before encoding starts. These slices only lived for the duration of the kStruct() function.

This is a perfect case for sync.Pool. Introducing sync.Pool to share these transient slices reduced bytes allocated by 58% and reduced encoding time by 11%.

Tags: technology go-codec