By Ugorji Nwoke   Wed, 17 Dec 2014 12:00:00 -0700   /blog   technology go-codec

How go-codec achieves its stellar performance

View articles in the go-codec series, source at http://github.com/ugorji/go

They say premature optimization is the root of all evil. I say some layers of the stack MUST be optimal. The layer that does marshalling of data MUST be optimal.

go-codec library supports code generation OR runtime reflection for its encoding and decoding. We will mostly discuss the runtime reflection in this article. It is easiest to compare this to other libraries.

Let’s look at some benchmark results

It is hard to do a good comparison, as they are only a few feature-rich libraries for each of these formats. In this sense, a feature-rich library must be able to encode/decode to/from any supported value e.g. map[string]string, struct{…}, interface{}.

Surprisingly, only a few libraries fit that bill. go-codec provides Msgpack and Json above, while Std_Json is provided in the standard library and Bson is provided by mongodb/Gustavo. Msgpack is more efficient that bson, but that is the closest I could find in terms of peers. We can get a sense of the performance by comparing json to std_json, and comparing msgpack to bson.

We start by showing some benchmark results using runtime reflection (with support for interfaces), which compare binary formats (go-codec msgpack vs bson) and text formats (go-codec json vs encoding/json).

Raw Data:

ENCODE
Benchmark__Std_Json___Encode	    5000	    124477 ns/op	   16313 B/op	     207 allocs/op
Benchmark__Json_______Encode	   10000	    108092 ns/op	    9267 B/op	      70 allocs/op
Benchmark__Bson_______Encode	    5000	    195656 ns/op	   34224 B/op	     852 allocs/op
Benchmark__Msgpack____Encode	   10000	     82517 ns/op	    9219 B/op	      70 allocs/op

DECODE
Benchmark__Std_Json___Decode	    2000	    359771 ns/op	   17992 B/op	     629 allocs/op
Benchmark__Json_______Decode	    3000	    205337 ns/op	   15344 B/op	     455 allocs/op
Benchmark__Bson_______Decode	    3000	    217165 ns/op	   19992 B/op	    1246 allocs/op
Benchmark__Msgpack____Decode	    5000	    130962 ns/op	   12880 B/op	     370 allocs/op

Let’s organize the results in a table form, using go-codec’s implementations as a baseline.

- Baseline Time Memory Allocations
Stdlib Json (encode) Json (encode) 1.2 X 1.8 X 3.0 X
Stdlib Json (decode) Json (decode) 1.7 X 1.1 X 1.4 X
Bson (encode) Msgpack (encode) 2.1 X 3.8 X 12.0 X
Bson (decode) Msgpack (decode) 1.6 X 1.5 X 3.4 X

The numbers are clear.

But how do we get such performance, when only depending on information available at runtime to encode/decode?

Minimize use of expensive constructs

For example, the following constructs are more expensive than expected:

  • defer/recover
  • method values: e.g. var r io.Reader; fn := r.Read

Within the library, we use panics to signal unrecoverable errors. This way, the internals of the library do not have to keep checking error values.

We noticed that recovering at any point caused a clear increase in encode/decode time. We now only recover at the top-level Encode/Decode function.

Understand comparison, especially when using maps

Equality/Comparison is different for different types of values:

  • for strings, compare the lengths, and then the byte contents
  • for structs, compare fields

Equality is really only cheap for numeric types.

To use a value as a key, it must implement comparison, and that can have an impact on performance.

For efficient encoding/decoding, we need to map each reflect.Type to a given function. This takes care of the following:

  • A type that is a encoding.BinaryMarshaler, and so we need to call its MarshalBinary() method.
  • A type whose underlying type is a known primitive (e.g. int, string, etc) and so we need to call the function for encoding/decoding primitive.
  • A type which has been registered as an extension type, and so we need to call the extension functions defined for it.

The typical way to do this, is by using a map[reflect.Type]func.

However, equality as defined for interfaces is quite expensive, so looking up values in the map as we walk a value (e.g. a struct with 16 fields) adds a lot to the time.

We ended up using a unique integer to represent each reflect.Type, and it became easy to look up functions using that integer, as opposed to doing reflect.Type comparisons.

var i int 
rt := reflect.TypeOf(i)
rtid := reflect.ValueOf(rt).Pointer()
// each reflect.Type returns the same value for the call above.

Judicious use of maps

We initially used maps extensively for the following:

  • keep mapping of integer id (representing reflect.Type) to encode/decode function
  • keep mapping of integer id (representing reflect.Type) to computed information about that type e.g.
    • does it implement encoding.BinaryMarshaler,
    • what names are to be used for each Field in the struct

We noticed that maps increase the number of allocations and memory used. It seems that go’s map implementation makes heavy use of the heap under the hood (for buckets, etc).

We moved to using linear or binary search across the board.

Fast, non-reflect path for common map and slice types

Many structs in practice will have slices or maps of interface{}, integers, floats, booleans and string.

It is expensive to iterate through a map or slice using reflection.

It is much faster to cast into that specific type and then iterate regularly (e.g. using range), or accessing the value directly.

To support this, we generate fast-path code for all combinations of maps and slices e.g. map[string]int, []interface{}, []bool, map[uint16]bool, etc.

When we see that the type or underlying type matches one of these, we directly call the fast-path encode or decode function.

Using type-switch

A type-switch on the interface{} allows us to bypass reflection for a lot of common cases. We can detect the concrete type for common types of primitives (number, bool, string, []byte), and for slices and maps (for which we generated concrete implementations already).

We use a type-switch to figure out which type matches. The type-switch implementation makes this a binary search over constants, which allows us effectively jump to the concrete implementation of the encode/decode function quickly.

Be judicious in heap allocation use

[]byte creation and sharing

To reduce heap allocations, we carefully embed most needed values inside the primary value. We also embed a scratch buffer which is shared during the encoding and decoding phases. This reduces the number of allocations and number of bytes allocated, while helping ensure that data that is used together is aligned together in memory.

For example, codec.Decoder looks like:

type Decoder struct {
	r  decReader // this interface value ends up being either &rb or &ri below.
	rb bytesDecReader
	ri ioDecReader
	b  [scratchByteArrayLen]byte // array used for transient []byte
                                 // e.g. reading floats, bigEndian encoding numbers,
                                 // reading a short string, etc.
}

Interface conversion

You should only use a pointer if absolutely necessary. Pointer use introduces dereferencing cost and allocation and heap interaction which could be avoided.

You should also understand where heap allocation may occur.

Converting a non-pointer value into an interface{} now does an allocation. From Go 1.4, all interfaces contain pointers only. Consequently, the following innocuous code below does allocations (as denoted by comment).

var i interface{} = 5 // allocation happens here
m := make(map[interface{}]interface{})
m[2]=true // 2 allocations happen here, at least, for converting values to interface{}

We are now more careful in use of interfaces and pointers.

Heap Allocation and Pointer dereference

We saw performance improvements by removing use of pointers.

It is always a fine-line when the size of the object can grow. Instead of using pointers all across the board, we now use values to an extent. An example is the decFnInfo struct we have. A snippet of the code is below:

type decFnInfoX struct {
	// ... less used fields
}

type decFnInfo struct {
	dd decDriver // most used field. Make accessible without dereference
	*decFnInfoX // less-used fields are accessible behind a pointer dereference
}

func (x decFnInfo) Method1(...) {
    // ...
}

In the snippet above, we use a value-receiver, which is only 3 words in size (comparable to a slice value). This way, there is no dereference needed in the typical use. Only when the less used fields are accessed do we incur the dereference cost.

Zero-copy mode

When encoding into or decoding from a []byte, we do not just wrap it in an io.Reader/io.Writer. Instead, we work with the []byte directly, and only create copies of the data when the data leaves the control of the library. For example, if a user wants to decode a []byte from the the stream, only then do we make a copy of it and return.

Assist the inliner

Inlining helps eliminate function call overhead.

Currently, the gc compiler does very primitive inlining, and will not inline the following types of functions:

  • closures
  • functions which more than 40 simple operations
  • functions which contain any of the following:
    for loop, select, switch, go, defer, type declaration, const declaration
  • functions containing the following may only be inlined if a specific switch is given: -gcflags -l=5 where l must be > 1 (see src/cmd/gc/inl.c for more info):
    function/method/interface calls, panic, recover
    variadic functions

It is surprising that switch and panic are in that list. switch statement gives great performance in some scenarios compared to if-else and better readability, and panics are used within libraries to propagate errors without having to check errors at each point.

However, we were careful to ensure that if-else and panics are not used in small leaf methods which are called often. This showed clear performance improvements.

We currently use interfaces to handling decoding/encoding to/from a []byte or an io.Reader/Writer. Interface calls are indirect calls and CANNOT be inlined. We simulated forcing some inlining and saw a 10% performance improvement. Unfortunately, we cannot use that as gc by default will not inline methods which make other calls.

(Optional) use of unsafe

This package understands the unsafe tag, to allow using unsafe semantics:

  • When decoding into a struct, you need to read the field name as a string so you can find the struct field it is mapped to. Using unsafe will bypass the allocation and copying overhead of []byte->string conversion.

Users can pass the -tags=unsafe option during install to take advantage of this performance improvement. It is optional.

Judicious use of sync.Pool

Most encoding formats (msgpack, binc, cbor, etc) prefix the length of a container before encoding the contents of the container.

To support omitEmpty, we need to gather all fields of a struct, figure out which ones should be omitted, then encode the length of the ones to be encoded before encoding starts. In code, we needed to create transient slices just to collect the ones to be encoded before encoding starts. These slices only lived for the duration of the kStruct() function.

This is a perfect case for sync.Pool. Introducing sync.Pool to share these transient slices reduced bytes allocated by 58% and reduced encoding time by 11%.

**Before introducing sync.Pool**
Benchmark__Json_______Encode	   10000	     66098 ns/op	    8544 B/op	      53 allocs/op

**After introducing sync.Pool**
Benchmark__Json_______Encode	   10000	     58566 ns/op	    3570 B/op	      45 allocs/op
Tags: technology go-codec


Subscribe: Technology
© Ugorji Nwoke