View articles in the go-codec
series, source at http://github.com/ugorji/go
They say premature optimization is the root of all evil. I say some layers of the stack MUST be optimal. The layer that does marshalling of data MUST be optimal.
go-codec library supports code generation OR runtime reflection for its encoding and decoding. We will mostly discuss the runtime reflection in this article. It is easiest to compare this to other libraries.
It is hard to do a good comparison, as they are only a few feature-rich libraries for each of these formats. In this sense, a feature-rich library must be able to encode/decode to/from any supported value e.g. map[string]string, struct{…}, interface{}.
Surprisingly, only a few libraries fit that bill. go-codec provides Msgpack and Json above, while _StdJson is provided in the standard library and Bson is provided by mongodb/Gustavo. Msgpack is more efficient that bson, but that is the closest I could find in terms of peers. We can get a sense of the performance by comparing json to std_json, and comparing msgpack to bson.
We start by showing some benchmark results using runtime reflection (with support for interfaces),
which compare binary formats (go-codec
msgpack vs bson
) and
text formats (go-codec
json vs encoding/json
).
Raw Data:
ENCODE
Benchmark__Json_______Encode-8 6051 188551 ns/op 3256 B/op 44 allocs/op
Benchmark__Std_Json___Encode-8 5514 218973 ns/op 74474 B/op 444 allocs/op
Benchmark__Msgpack____Encode-8 14095 84318 ns/op 3192 B/op 44 allocs/op
Benchmark__Bson_______Encode-8 4936 239069 ns/op 222828 B/op 364 allocs/op
DECODE
Benchmark__Json_______Decode-8 3105 390624 ns/op 89300 B/op 1041 allocs/op
Benchmark__Std_Json___Decode-8 1365 855218 ns/op 138558 B/op 3032 allocs/op
Benchmark__Msgpack____Decode-8 5866 203320 ns/op 67387 B/op 913 allocs/op
Benchmark__Bson_______Decode-8 2582 467415 ns/op 183853 B/op 4085 allocs/op
Let’s organize the results in a table form, using go-codec’s implementations as a baseline.
- | Baseline | Time | Memory | Allocations |
---|---|---|---|---|
Stdlib Json (encode) | Json (encode) | 1.16 X | 22.87 X | 10.1 X |
Stdlib Json (decode) | Json (decode) | 2.18 X | 1.55 X | 2.9 X |
Bson (encode) | Msgpack (encode) | 2.83 X | 69.8 X | 8.2 X |
Bson (decode) | Msgpack (decode) | 2.30 X | 2.7 X | 4.47 X |
The numbers are clear.
But how do we get such performance, when only depending on information available at runtime to encode/decode?
For example, the following constructs are more expensive than expected:
var r io.Reader; fn := r.Read
Within the library, we use panics to signal unrecoverable errors. This way, the internals of the library do not have to keep checking error values.
We noticed that recovering at any point caused a clear increase in encode/decode time. We now only recover at the top-level Encode/Decode function.
Equality/Comparison is different for different types of values:
Equality is really only cheap for numeric types.
To use a value as a key, it must implement comparison, and that can have an impact on performance.
For efficient encoding/decoding, we need to map each reflect.Type to a given function. This takes care of the following:
The typical way to do this, is by using a map[reflect.Type]func.
However, equality as defined for interfaces is quite expensive, so looking up values in the map as we walk a value (e.g. a struct with 16 fields) adds a lot to the time.
We ended up using a unique integer to represent each reflect.Type, and it became easy to look up functions using that integer, as opposed to doing reflect.Type comparisons.
var i int
rt := reflect.TypeOf(i)
rtid := reflect.ValueOf(rt).Pointer()
// each reflect.Type returns the same value for the call above.
We initially used maps extensively for the following:
We noticed that maps increase the number of allocations and memory used. It seems that go’s map implementation makes heavy use of the heap under the hood (for buckets, etc).
We moved to using linear or binary search across the board.
Many structs in practice will have slices or maps of interface{}, integers, floats, booleans and string.
It is expensive to iterate through a map or slice using reflection.
It is much faster to cast into that specific type and then iterate regularly (e.g. using range), or accessing the value directly.
To support this, we generate fast-path code for all combinations of maps and slices e.g. map[string]int, []interface{}, []bool, map[uint16]bool, etc.
When we see that the type or underlying type matches one of these, we directly call the fast-path encode or decode function.
A type-switch on the interface{} allows us to bypass reflection for a lot of common cases. We can detect the concrete type for common types of primitives (number, bool, string, []byte), and for slices and maps (for which we generated concrete implementations already).
We use a type-switch to figure out which type matches. The type-switch implementation makes this a binary search over constants, which allows us effectively jump to the concrete implementation of the encode/decode function quickly.
To reduce heap allocations, we carefully embed most needed values inside the primary value. We also embed a scratch buffer which is shared during the encoding and decoding phases. This reduces the number of allocations and number of bytes allocated, while helping ensure that data that is used together is aligned together in memory.
For example, codec.Decoder looks like:
type Decoder struct {
r decReader // this interface value ends up being either &rb or &ri below.
rb bytesDecReader
ri ioDecReader
b [scratchByteArrayLen]byte // array used for transient []byte
// e.g. reading floats, bigEndian encoding numbers,
// reading a short string, etc.
}
You should only use a pointer if absolutely necessary. Pointer use introduces dereferencing cost and allocation and heap interaction which could be avoided.
You should also understand where heap allocation may occur.
Converting a non-pointer value into an interface{} now does an allocation. From Go 1.4, all interfaces contain pointers only. Consequently, the following innocuous code below does allocations (as denoted by comment).
var i interface{} = 5 // allocation happens here
m := make(map[interface{}]interface{})
m[2]=true // 2 allocations happen here, at least, for converting values to interface{}
We are now more careful in use of interfaces and pointers.
We judiciously honor some tested performance helpers
When encoding into or decoding from a []byte, we do not just wrap it in an io.Reader/io.Writer. Instead, we work with the []byte directly, and only create copies of the data when the data leaves the control of the library. For example, if a user wants to decode a []byte from the the stream, only then do we make a copy of it and return.
Inlining helps eliminate function call overhead.
Currently, the gc
compiler does very primitive inlining, and will not inline
the following many functions which are not very simple. Within the code, we are
aware of these and try to help the inliner so most hot functions are inlineable.
For example, switch
statements and for
loops prevent a function from being inlined.
In the code, we may use if
statements and goto
instead to ensure they are inlined.
Also, function which call other functions generally cannot be inlined if they do any more
than a simple assignment or conditional.
Go will do a bounds-check on each slice or array access unless it can prove that the access is within bounds.
We judiciously assist the compiler to eliminate as many bounds checks as possible.
See https://docs.google.com/document/d/1vdAEAjYdzjnPA9WDOQ1e4e05cYVMpqSxJYZT33Cqw2g and https://www.ardanlabs.com/blog/2018/04/bounds-check-elimination-in-go.html
This package understands the unsafe
tag, to allow using unsafe semantics:
unsafe
will bypass the allocation and copying overhead of []byte->string
conversion.Users can pass the -tags=safe
option during install to use default safe semantics of the go runtime
and the reflection package. By default, unsafe is used to give the performance improvements.
Most encoding formats (msgpack, binc, cbor, etc) prefix the length of a container before encoding the contents of the container.
To support omitEmpty, we need to gather all fields of a struct, figure out which ones should be omitted, then encode the length of the ones to be encoded before encoding starts. In code, we needed to create transient slices just to collect the ones to be encoded before encoding starts. These slices only lived for the duration of the kStruct() function.
This is a perfect case for sync.Pool. Introducing sync.Pool to share these transient slices reduced bytes allocated by 58% and reduced encoding time by 11%.