By Ugorji Nwoke 16 Nov 2011 /blog appengine

GO App Engine datastore operations design

GO App Engine datastore.Load/Save uses goroutines and channels to iterate over datastore entity properties, causing overhead.

Background
With GAE 1.6.0, Support for Indexed Properties, Hooks, etc was introduced with a nice, elegant design using a PropertyLoadSaver interface that uses channels (as an iterator).

I noticed that, after updating my code to utilize the PropertyList, some of my application requests started taking about double the time they were taking before. Previously, with datastore.Map, my requests still took roughly same amount of time.

On digging further, I found the following in the implementation:

appengine/datastore/load.go
  func loadEntity(dst interface{}, src *pb.EntityProto) ...
      c := make(chan Property, 32)
      errc := make(chan os.Error, 1)
      go protoToProperties(c, errc, src)
appengine/datastore/save.go
  func saveEntity(defaultAppID string, key *Key, src interface{}) ...
      c := make(chan Property, 32)
      donec := make(chan struct{})
      go func() { ... }

That is, For each entity (analogous to each row in a table), we create and use:

 1 goroutine and 2 channels.

The deprecated datastore.Map retrieval bypasses this Channel/Goroutine dance, which is why my response time did not change until I switched to datastore.PropertyList.

Concerns:

For each Get/Save request, which load/save n entities, n goroutines and 2*n channels are started. This is analogous to starting 1 goroutine and 2 channels for each row in sql return set. For example, One API RPC call that returns 100 entities will cause 100 goroutines spawned and 200 channels created just for this one API call.
However, the requests are still serialized (ie one load/save conversion is performed before the other). So we don’t gain any parallelism but pay a large cost.
This is a significant overhead for a simple iteration where concurrency is not a goal.
There’s still significant allocation (which is what we were trying to avoid). There ends up potentially being more allocation than if we just returned []Property.
- each “row” or entity causes a new goroutine and channel to be returned.
- each channel (for iterating properties) has a buffer size of 32.
Implementation detail bleeding into the API, making it harder to optimize later. By typing API to channels, we cannot optimize out of this later on.
Also, we seem to be using the “channels as iterator” anti-pattern, which is frowned upon, especially by the GO team. See:
Russ Cox: http://code.google.com/p/go/source/detail?r=ed32ab5693
Russ Cox: https://groups.google.com/d/msg/golang-nuts/jb4YfdFwmmM/P-55mxV0a8oJ
David Symonds: https://groups.google.com/d/msg/golang-nuts/bcAWzaSYC0Y/nk-b5fUR_loJ
http://stackoverflow.com/questions/5033605/common-programming-mistakes-for-go-developers-to-avoid/5034195#5034195

Can we do without the goroutines/channels, especially in the API? This way, we can use different implementations.

Alternative solution using iterators
An alternative, equally elegant solution would just use iterators:

type PropertyIterator interface:
Next() (Property, os.Error) //To signal end, os.EOF/datastore.Done is returned
type PropertyLoadSaver interface:
PropertyLoad(PropertyIterator p) os.Error
PropertySave() PropertyIterator, os.Error

For implementations of PropertyIterator:

type PropertyIteratorFunc func() (Property, os.Error)
Next() (Property, os.Error) //calls itself
type PropertyList []Property:
Iterator() PropertyIterator //actually a PropertyIteratorFunc
type ChanPropertyIterator chan Property: //implements PropertyIterator itself
Next() (Property, os.Error) //does a <- on channel, and returns os.EOF/datastore.Done appropriately
//optional: for people who prefer goroutine/channel alloc over slice alloc)
//this would be like the current solution today, but not exposed in the API)

Since GO Runtime is still experimental, making a contained API change should be ok.

But RPC dominates the overhead per request. Why focus on goroutines/channels use?

Definitely, the RPC time will dominate the overhead from a goroutine and 2 channels. However, we’re talking about potentially 100’s or 1000’s of goroutines per request (equal to the number of “row” returned by, or sent to the API call). E.g. for a GET that returns 100 entities, thats 100 goroutines and 200 channels created to service that 1 API call. And these goroutines/channels we’re making have nothing to do with concurrency: we’re just using this for iterators.

Also, within our application code, we still have to optimize our code (and especially our exported APIs), even though we know that RPC overhead will overshadow it.

Main Concern: Implementation bleeds into the API
My main concern is that this bleeds into the API. By using Iterators, you can use channels and a goroutine in the implementation, and change that afterwards, without application users having to know about it.

The alternative implementation proposed above shows how thic can be done using iterators. It’s trivial to implement (in GO code) and you can gain what you want, without restricting your implementation:

Objects don’t need to exist longer than it needs to populate the fields
Intermediate state is supported
No need to pass around []Property for a large entity

However, the API is not tied to an implementation, so you can implement with goroutines/channels, or with a List. User code that passes a PropertyLoadSaver can use whatever is most applicable/optimized for his usecase. For example, in my user code, I can pass PropertyList into each call and will not incur the overhead of goroutines/channels.

Have others solved similar problems using goroutines/channels? Where?
It seems that the use of goroutines/channels as iterators is not done in other similar places:

See datastore.Query whose iteration doesn’t expose goroutines/channels
See exp/sql/driver whose iteration doesn’t expose goroutines/channels (just a Next([]interface{}) method)

What is the performance overhead (load on CPU, RAM) with this? Does it scale?

Initially, when I did this, I ran some rudimentary tests to find the maximum number of goroutines I could create on my machine and how much resources it took.

The summary of the results is that, On a 2.0GHz core, I could start a maximum of 5e5 (500,000) goroutines which basically did nothing (beyond that, I got errors). The RAM usage was 2.0GB.

An app engine instance is 600MHz single core with 128MB limit. That’s about a ¹⁄₄ the CPU and ¹⁄₂₀ the memory. (Even my nexus one has way more resources than that.)

In summary, 2.0GHz, 2GB RAM produced 500,000 goroutines max. I wonder how many a 600MHz, 128MB app engine instance would accomodate.

I’d suspect a few thousand goroutines on such a tiny “computer” (600MHz, 128MB) would tax the system. However, it’s really easy to get into such a situation with the current design. If most of the time is spent on RPC (I/O) and CPU load is low, GO can easily support a large number of concurrent requests. 50 concurrent requests each retrieving 200 entities will mean 10,000 goroutines (+20000 channels) at the same time, just serving API requests, and imposed by the SDK runtime (ie not application code which we can control or tune). In this scenario, the runtime is imposing an overhead which does not seem necessary.

If we expect that most people will pass a PropertyList to calls to GetXXX or PutXXX, then the goroutine/channel is completely redundant.

Also, remember that each goroutine allocates an initial stack of 4K, so each goroutine has a cost in memory allocation, which becomes non-trivial under load.

The rudimentary go code used to run this test is available at:

Shared online: You can download a go file, to compile and run on your computer here.
On Golang Play: You need to run this on your local computer

Tags: appengine