By Ugorji Nwoke   Sun, 21 Dec 2014 15:00:00 -0700   /blog   technology go-codec

go-codec: Primer and How To Guide

View articles in the go-codec series, source at http://github.com/ugorji/go

Below, we will walk you through using the go-codec library for your serialization needs.

go-codec is a high performance and feature rich library that provides idiomatic encoding and decoding support for the following formats:

  1. msgpack: https://github.com/msgpack/msgpack
  2. binc: http://github.com/ugorji/binc
  3. cbor: http://cbor.io http://tools.ietf.org/html/rfc7049
  4. simple:
  5. json: http://json.org http://tools.ietf.org/html/rfc7159

The go-codec library supports runtime introspection and/or code generation.

Documentation / Source code

Source code: http://github.com/ugorji/go

godoc documentation: http://godoc.org/github.com/ugorji/go/codec

Installation

go get -u github.com/ugorji/go/codec github.com/ugorji/go/codec/codecbench

The command above install 2 things:

  • codecbench: a command-line tool for code generation support
  • go/codec: a runtime library for encoding/decoding via runtime introspection of named types

This package understands the unsafe tag, to allow using unsafe semantics:

  • When decoding into a struct, you need to read the field name as a string so you can find the struct field it is mapped to. Using unsafe will bypass the allocation and copying overhead of []byte->string conversion.

To use it, you must pass the unsafe tag during install:

go get -tags=unsafe -u github.com/ugorji/go/codec github.com/ugorji/go/codec/codecbench
OR
go install -tags=unsafe github.com/ugorji/go/codec github.com/ugorji/go/codec/codecbench

Unfortunately, many users depend on the import path “github.com/ugorji/go/codec”, so I cannot change it to a preferred “github.com/ugorji/go-codec”.

Using go-codec library

As go-codec supports multiple formats, a user will need to configure a Handle. The handle tells what format to use.

The handles supported are:

  1. MsgpackHandle
  2. BincHandle
  3. CborHandle
  4. SimpleHandle
  5. JsonHandle

In the rest of the article, I will mostly use JsonHandle. It is straightforward to use a different Handle.

Let us first show how to use without much configuration.

We will leverage these go values in the rest of the article:

type A struct {
    I int
    S string
}
type B float64

var v1 A
var v2 *A = &v1
var v3 int = 9
var v4 bool = false
var v5 interface{} = v3
var v6 interface{} = nil
var v7 B
var v8 *B = &v7

How to use Handle, Encoder, Decoder

The Handle is SAFE for concurrent READ, but NOT SAFE for concurrent modification. This means that you should configure Handle completely before use, and then pass it on to different initialization of Encoder or Decoders.

The Encoder and Decoder are NOT SAFE for concurrent use. However, they support a Reset method to allow them to be re-used. Reuse helps leverage the state maintained e.g. mapping of type ids to dedicated functions, byte buffers used, etc.

The general usage model for go-codec is:

  • Create and initialize a Handle completely. Do not modify it once it is passed to an Encoder/Decoder.
  • Within your code, for each thread of execution, create a Encoder or Decoder. Re-use it by calling the Reset method.

Encoding

Any go value can be encoded. In the example above, any of v1 through v8 can be encoded.

A user may want to encode a value as a []byte. Doing this will use direct updates to a []byte, and bypass interface calls and overhead of io.Writer interface. It strives for a zero-copy model.

var b []byte = make([]byte, 0, 64)
var h codec.Handle = new(codec.JsonHandle)
var enc *codec.Encoder = codec.NewEncoderBytes(&b, h)
var err error = enc.Encode(v1) //any of v1 ... v8
// b now contains the encoded value. 

A value can also be serialized into an io.Writer.

var w io.Writer = new(bytes.Buffer)
var h codec.Handle = new(codec.JsonHandle)
var enc *codec.Encoder = codec.NewEncoder(w, h)
var err error = enc.Encode(v1) //any of v1 ... v8

We recommend that the user provides a buffered writer for performance. For example, one could wrap an io.Writer in a bufio.Writer. go-codec will never buffer during reading or writing; buffering if necessary is the job of the caller.

var w io.Writer // io.Writer got from somewhere
var bw = bufio.NewWriter(w)
var h codec.Handle = new(codec.JsonHandle)
var enc *codec.Encoder = codec.NewEncoder(bw, h)
var err error = enc.Encode(v1) //any of v1 ... v8
bw.Flush()

Decoding

To decode, pass a pointer to a value. go-codec will then decode into that value.

A pointer must be passed, so that we can decode into the value behind the pointer.

A user may want to decode directly from a []byte. This is fastest, bypasses interface calls and other overhead of io.Reader, and strives for zero-copy mode while reading.

var b []byte
// ... assume b contains the bytes to decode from
var h codec.Handle = new(codec.JsonHandle)
var dec *codec.Decoder = codec.NewDecoderBytes(b, h)
var err error = dec.Decode(v2) //v2 or v8, or a pointer to v1, v3, v4, v5, v6, v7

A user may also decode from an io.Reader.

var r io.Reader
// ... assume r contains the data to decode from
var h codec.Handle = new(codec.JsonHandle)
var dec *codec.Decoder = codec.NewDecoderBytes(r, h)
var err error = dec.Decode(v2) //v2 or v8, or a pointer to v1, v3, v4, v5, v6, v7

We recommend that the user provides a buffered reader for performance. go-codec will NOT internally do a buffering because users may want to combine encoded data with other data, and want go-codec to only read the bytes needed.

var r io.Reader
// ... assume r contains the data to decode from
var br = bufio.NewReader(r)
var h codec.Handle = new(codec.JsonHandle)
var dec *codec.Decoder = codec.NewDecoderBytes(br, h)
var err error = dec.Decode(v2) //v2 or v8, or a pointer to v1, v3, v4, v5, v6, v7

Decode will update the value passed.

  • If the value is a struct, it will update the fields that exist in the stream
  • If the value is a slice, it will update the slice to have a length equal to the length of items in the stream.
  • If the value is a map, it will update the entries in the map
  • If the value is a Func, it will be skipped (ignored)
  • Otherwise, the value is a primitive (number, bool, string) and is updated as such.

The key thing to note is that we update during a decode.

Consequently, if you have the following:

var m = map[string]*A{"1": &A{I:1, S:"one"}, "2": &A{I:2, S:"two"} }
fmt.Printf("before: %v\n", m)
var b = []byte(`{"1": {"I":111}, "3": {"I": 333} }`)
var err error = codec.NewDecoderBytes(b, new(codec.JsonHandle)).Decode(&m)
fmt.Printf(" after: %v\n", m)
for k, v := range m {
    fmt.Printf("\t%v: %v\n, k, v)
}

Running that code should output:

before: map[2:0xc20801f100 1:0xc20801f0e0]
 after: map[1:0xc20801f0e0 2:0xc20801f100 3:0xc20801f220]
	1: &{111 one}
	2: &{2 two}
	3: &{333 }

You will notice the following:

  • We update the values in the map.
    You see that in the addresses of the values for mapping “1” staying the same. Some other libraries will empty the map and create a zero value, update that, and override the value in the map.
  • We do not clear the map before inserting values.

When decoding into a map, we DO NOT delete map keys which do not exist in the stream. This gives symetry as we only update tables (maps, structs) and never truncate.

To have a decoded value mirror exactly what was in the encoded stream, you should decode into a zero’ed value e.g. empty map, empty slice, new(XYZ) where XYZ is a struct, etc.

Decoding without knowing what is in the stream

What if you don’t know the structure of your data beforehand?

We use the excellent support for interfaces in go.

Every go value can be converted to an interface{}. A type switch or type assertion can be used to retrieve the value back from the interface{}.

A nil interface{} is an interface{} without a value inside. When a pointer to this is passed into Decode(), go-codec will decode a value based on the structure of the stream as it is parsed.

  • maps are decoded as map[interface{}]interface{}.
    Users can configure the default map type to use. See section on configuration
  • sequences are decoded as []interface{}.
    Users can configure the default slice type to use. See section on configuration
  • primitives are decoded as int64, uint64, float64, bool, string, []byte or nil.

Sample code to decode is below:

var b = []byte(`{"1": {"I":111}, "3": {"I": 333} }`)
var m interface{}
var err error = codec.NewDecoderBytes(b, new(codec.JsonHandle)).Decode(&m)
fmt.Printf("decoded type : %T\n", m)
fmt.Printf("decoded value: %v\n", m)

Output:

decoded type : map[interface {}]interface {}
decoded value: map[1:map[I:111] 3:map[I:333]]

Handling decoding when stream keys/indexes do not map to struct fields

go-codec will never silently skip data in the stream without allowing the user decide how to handle it.

There are a few scenarios where there is ambiguity:

  • When decoding into a struct from a map in the stream, some keys in the stream may not map to any fields in the struct. For example, the map in the stream has a key FieldN but there is no corresponding field in the struct.
  • When decoding into a struct from an array in the stream, some indexes in the stream may exist beyond the number of fields in the stream. For example, the array in the stream has 8 elements, but there are only 5 fields in the struct.

For both of these, go-codec allows the user configure whether an error should be returned or the data silently skipped. See ErrorIfNoField configuration below.

Configuration

go-codec supports configuration at 2 levels:

  1. source code annotation (using struct tags)
  2. runtime configuration, on the specified Handle

Source configuration (for encoding)

go-codec allows you configure how you want a struct to be encoded using struct tags.

Encoding can be configured via the struct tag for the fields.

The “codec” key in the struct field’s tag value is the key name, followed by an optional comma and options. Note that the “json” key is used in the absence of the “codec” key.

To set an option on all fields (e.g. omitempty on all fields), you can create a field called _struct, and set flags on it.

Struct values “usually” encode as maps. Each exported struct field is encoded unless:

  • the field’s tag is “-”, OR
  • the field is empty and its tag specifies the “omitempty” option.

Note that omitempty is ignored when encoding struct values as arrays, as an entry must be encoded for each field, to maintain its position.

When encoding as a map, the first string in the tag (before the comma) is the map key string to use when encoding.

However, struct values may encode as arrays. This happens when the tag on the _struct field sets the “toarray” option

Values with types that implement codec.MapBySlice are encoded as stream maps.

The empty values (for omitempty option) are

  • bool: false
  • number (float, or signed or unsigned integer): 0
  • string: “”
  • container (slice, array, map): length = 0
  • pointer, interface: nil

Note that omitEmpty does not apply to structs, as there is no efficient way to test that a struct is equal to its zero value.

Anonymous fields are encoded inline except

  • the struct tag specifies a replacement name (first value)
  • the field is of an interface type

When encoding a struct, all unexported fields are skipped. Exported fields can be:

  • renamed
  • skipped
  • omitted if empty (equal to its zero value)

The code snippet below illustrates configuration of a struct.

type Anon struct {
    S string
}
// NOTE: 'json:' can be used as struct tag key, in place 'codec:' below.
type My struct {
    _struct struct{} `codec:",omitempty"`   //set omitempty for every field
    Field1 string    `codec:"-"`            //skip this field
    Field2 int       `codec:"myName"`       //Use key "myName" in encode stream
    Field3 int32     `codec:",omitempty"`   //use key "Field3". Omit if empty.
    Field4 bool      `codec:"f4,omitempty"` //use key "f4". Omit if empty.
    field5 bool                             // unexported, so skipped
    Anon                                    // anonymous field, S is inlined.
                                            // stream will contain S as if a regular field.
}

// NOTE: 'json:' can be used as struct tag key, in place 'codec:' below.
type My2 struct {
    _struct bool     `codec:",omitempty,toarray"`   //set omitempty for every field
                                                    //and encode struct as an array
    *Anon `codec:""` // anonymous field, with no struct name, so inline it.
                     // stream wll contain "S"
                     // OR
    *Anon `codec:"abc"` // anonymous field, with specified struct name, so DO NOT inline it.
                        // stream wll contain "abc.S"
}

Runtime Configuration

Every Handle has a set of basic options:

General options include:

  • TypeInfos *TypeInfos:
    Use this to configure a unique set of struct tag keys for your application. This is a niche option, used mostly by folks migrating from a different toolkit, where they may have maintained similar struct tags but using a different name.

EncodeOptions take precendence over values defined using struct tags.

  • StructToArray bool: default false:
    If true, encode a struct as an array, else encode struct as a map.
  • Canonical bool: default false:
    If true, encode each value as the exact same sequence of bytes. This is most applicable to maps, where iteration is typically random.

DecodeOptions configure what happens during a decode:

  • MapType reflect.Type: default map[interface{}]interface{}:
    Use this type when decoding from a map in the stream into a nil interface{}
  • SliceType reflect.Type: default []interface{}:
    Use this type when decoding from an array in the stream into a nil interface{}
  • ErrorIfNoField bool: default false:
    If true, return an error if a map in the stream has a key which does not map to any field; else read and discard the key and value in the stream and proceed to the next.
  • ErrorIfNoArrayExpand bool: default false:
    If true, return an error if the decoding into an array (e.g. [10]int) and the array in the stream has more elements than the capacity of the go array passed.
  • SignedInteger bool: default false:
    If true, decode an integer into a nil interface{} as a int64 (even if the stream says it is an unsigned integer).
  • InterfaceReset bool: default false:
    Controls how we decode into an interface. If true, we will just decode into a new interface{} value, ignoring what was in it before. If false, we will decode into the value pointed to by that interface.
  • MapValueReset bool: default false:
    Controls how we decode into a map value. If true, we will just decode into a new value, ignoring what was in the map before. If false, we will retrieve the previous mapping if it is a reference, struct or array, and then decode into that value if it exists. This allows us “append to” the previous state.
  • InternString bool: default false:
    Controls whether we maintain a pool of strings which are keys (struct fields or keys of maps which are strings). During decoding, this MAY reduce allocations and improve response times if the number of keys are finite.

As an example, you can configure your handle as below:

var jh codec.JsonHandle
jh.MapType = reflect.TypeOf(map[string]int(nil))
jh.SliceType = reflect.TypeOf([]string(nil))

// for encoding
var w io.Writer
var enc *codec.Encoder = codec.NewEncoder(w, h)

// for decoding
var r io.Reader
var dec *codec.Decoder = codec.NewDecoder(r, h)

Format specific Runtime Configuration

Some formats support extra configuration options.

Messagepack

Up until 2013, Messagepack had a single type: raw, which was used for raw bytes. Different libraries interpreted it either as a binary array of bytes, or a unicode-style string. For languages which supported different binary vs string types (e.g. java, go, python, etc), this presented a problem.

In 2013, the spec was upated: raw is renamed to Str, and a new Bin was introduced to represent binary data.

However, libraries want to maintain compatibility with the choices they made previously in interpreting raw. The go-codec library previously treated raw as []byte by default, with an option to treat it as string.

Furthermore, formal extension support was added in the updated spec. Consequently, legacy applications do not understand the ext or the Bin messagepack type.

The options below are setup so that Messagepack handle is compatible with the legacy spec by default.

  • RawToString bool: default false:
    If true, decode raw bytes in the messagepack stream into a nil interface{} as a string, else decode as []byte.
  • WriteExt bool: default false:
    If true, encode configured extensions with extension tags, else encode as raw bytes and do not support messagepack str8 type. See MsgpackHandle documentation for more information.

If a user wants to be compatible with the updated spec, he/she just has to setup his MsgpackHandle as below:

var h MsgpackHandle
h.WriteExt=true

See the legacy and new/updated messpagepack specs for more information.

Encoding Anonymous Fields

Anonymous fields are encoded inline except

  • the struct tag specifies a replacement name (first value)
  • the field is of an interface type

To encode as a separate regular field, specify a name in the struct tag (first value in the struct tag).

go-codec has support for Anonymous fields which are pointers or non-pointers.

Encoding and Decoding named types

When encoding named types, we follow the following sequence:

  • if value implements codec.Selfer, call its CodecEncodeSelf method
  • If an extension is configured for the type, call the extension encode function
  • if value implements encoding.BinaryMarshaler and format is a binary format, call its MarshalBinary method
  • if value implements encoding.TextMarshaler and format is a text format, call its MarshalText method
  • else encode based on underlying Kind.
    if underlying kind is struct, iterate through its exported fields.

When decoding named types, we follow a similar sequence:

  • if value implements codec.Selfer, call its CodecDecodeSelf method
  • If an extension is configured for the type, call the extension decode function
  • if value implements encoding.BinaryUnmarshaler and format is a binary format, call its UnmarshalBinary method
  • if value implements encoding.TextUnmarshaler and format is a text format, call its UnmarshalText method
  • else decode based on underlying Kind.
    if underlying kind is struct, iterate through its exported fields.

From this, we see that users have a few options for controlling how values are encoded or decoded into:

  • implement the codec.Selfer interface
  • implement the encoding.(Text|Binary)(M|Unm)arshaler interface
  • configure an extension

The most robust solution is an extension. This is detailed below.

Using extensions

From this, we see that users have a few options for controlling how values are encoded or decoded into:

  • implement the codec.Selfer interface
  • implement the encoding.(Text|Binary)(M|Unm)arshaler interface
  • configure an extension

Of these, using the extension support of the format is the most robust.

An extension plays nicely with decoding into a nil interface{}. This is because we can see the tag in the stream, and find out what type is mapped to that tag, and decode into a new instance of it.

To use an extension, create a type implementing codec.Ext, then call SetExt function of your handle. To illustrate, the snippet below creates an extension that encodes/decodes a time.Time to/from a 64-bit integer. It then sets it on a CborHandle which is passed into NewDecoder or NewEncoder functions.

type TimeExt struct{}

func (x TimeExt) WriteExt(interface{}) []byte { panic("unsupported") }
func (x TimeExt) ReadExt(interface{}, []byte) { panic("unsupported") }
func (x TimeExt) ConvertExt(v interface{}) interface{} {
	switch v2 := v.(type) {
	case time.Time:
		return v2.UTC().UnixNano()
	case *time.Time:
		return v2.UTC().UnixNano()
	default:
		panic(fmt.Sprintf("unsupported format for time conversion: expecting time.Time; got %T", v))
	}
}
func (x TimeExt) UpdateExt(dest interface{}, v interface{}) {
	tt := dest.(*time.Time)
	switch v2 := v.(type) {
	case int64:
		*tt = time.Unix(0, v2).UTC()
	case uint64:
		*tt = time.Unix(0, int64(v2)).UTC()
	default:
		panic(fmt.Sprintf("unsupported format for time conversion: expecting int64/uint64; got %T", v))
	}
}

func main() {
	timeTyp := reflect.TypeOf(time.Time{})
    var timeExt TimeExt
    h := new(codec.CborHandle)
	h.SetExt(timeTyp, 1, timeExt)
    // now use h as your handle.
    // time.Time will now be encoded as a uint64, and decoded from a uint64 or int64
}

Encoding a slice type as a map

Some users may represent a table with key-value pairs as a slice with an even length.

type X []interface{}
func (_ X) MapBySlice() { }
v := X{"key1", 1, "key2", 2, "key3", 3}

when we encode v above, it will be encoded in the stream as a map.

This is the best way to enforce a specific order in a map, as iteration of a go map has no defined order.

Some users will convert a map into a MapBySlice implementation, then encode that to force a specific order in the stream.

Encode/Decode from/to chan type (for iterative streaming support)

go-codec will encode a chan as an array in the stream. It will also decode an array in the stream into a chan.

This allows a typical request, where a user wants to decode a very large array without loading up all the values in memory first in a slice. This will save memory and CPU time in a BIG WAY.

Sample usecase:

// You have a large number of values encoded in json format as below:
// [ 
//   { object },
//   { object },
//   (trillions of entries)
// ]
//
// For best performance, you will use a decode the elements into a channel and
// concurrently process them one by one.
//
var h codec.JsonHandle // or CborHandle, or MsgpackHandle, etc
if _, ok := r.(io.ByteScanner); !ok { 
    r = bufio.NewReader(r) 
} // use a buffered reader for efficiency
ch := make(chan Adresse, 128) // channel to decode into
finish := make(chan struct{}) // unbuffered channel for signaling goroutine finish
go func() {
    for e := range ch { // process till all values are received and channel is closed
        // process e
    }
    finish <- struct{}{} // send signal closing channel
}()
var dec *codec.Decoder = codec.NewDecoder(r, &h)
var err error = dec.Decode(&ch)
close(ch)
<- finish // wait for goroutine to finish processing channel

Note: When encoding, the length is first retrieved, and only those elements up to the length are encoded from the channel. Any values added to the channel thereafter are not encoded.

Smart decoding of structs and slices (from map or array in stream)

We have already seen how structs can be encoded as a map or an array:

  • StructToArray option on the Handle
  • toArray option on the struct field tag

Also, we see above that a Slice type can be encoded as a map:

  • if it implements the MapBySlice interface

Consequently,

  1. A slice can be decoded from either an array or a map in the stream
  2. A struct can be decoded from either an array or a map in the stream

Canonical encoding of values

Canonical representation means that encoding a value will always result in the same sequence of bytes. This applies ONLY to maps, which iterate (via range call) in random order.

codec will attempt to sort based on the natural ordering of the keys (numerically or lexicographically). However, if there is no natural ordering, then the keys will be encoded out of band to []byte, and the []byte sorted instead.

There is a slight performance hit if Canonical flag is on, as we MAY have to encode the keys out-of-band, and then sort them, before encoding the whole map.

This is configured using the Canonical flag on the Handle.

** Canonical flag is ignored by codecgen (code generation). **

Handling NIL on decode

go-codec will treat NIL in a stream as the zero value.

It will then set the value being decoded into to its zero value.

This means that NIL can come in as the value of an int, and we decode that int as 0.

This is important for usecases as below:

  • Encode a *int.
    This could encode as NIL, or as an integer value
  • When decoding, you need to decode it into an integer value.
    decoding may see a NIL, and decode it as 0. decoding may see a 5, and decode it as 5.

Drop-in replacement for encoding/json. ‘json:’ key in struct tag supported

encoding/json uses the json: key in the struct tag value to configure how the struct is encoded. go-codec will use the json key as a fallback, if the codec key is unavailable in the struct tag value.

This allows go-codec to be used as a drop-in replacement for encoding/json without having to make changes to the structs.

RPC

go-codec also provides RPC support that integrates with the net/rpc package.

Please read the net/rpc package to see how to use it. Fundamentally, the net/rpc package requires a ServerCodec and ClientCodec.

go-codec provides these implementations.

go-codec also supports messagepack’s custom RPC communication model.
To use, replace codec.GoRpc with codec.MsgpackSpecRpc in snippets below.

RPC Server would look like this:

    //RPC Server
    go func() {
        for {
            conn, err := listener.Accept()
            rpcCodec := codec.GoRpc.ServerCodec(conn, h) // OR codec.MsgpackSpecRpc...
            rpc.ServeCodec(rpcCodec)
        }
    }()

RPC Clients would look like this:

    //RPC Communication (client side)
    conn, err = net.Dial("tcp", "localhost:5555")
    rpcCodec := codec.GoRpc.ClientCodec(conn, h) // OR codec.MsgpackSpecRpc...
    client := rpc.NewClientWithCodec(rpcCodec)

Code Generation

go-codec can be used via its runtime introspection or its code generation support.

The code generation support works by creating codec.Selfer implementation methods.

It integrates seamlessly with everything written in this article.

Tags: technology go-codec


Subscribe: Technology
© Ugorji Nwoke