GO App Engine datastore.Load/Save uses goroutines and channels to iterate over datastore entity properties, causing overhead.
Background
With GAE 1.6.0, Support for Indexed Properties, Hooks, etc was
introduced with a nice, elegant design using a PropertyLoadSaver
interface that uses channels (as an iterator).
I noticed that, after updating my code to utilize the PropertyList, some of my application requests started taking about double the time they were taking before. Previously, with datastore.Map, my requests still took roughly same amount of time.
On digging further, I found the following in the implementation:
appengine/datastore/load.go
func loadEntity(dst interface{}, src *pb.EntityProto) ...
c := make(chan Property, 32)
errc := make(chan os.Error, 1)
go protoToProperties(c, errc, src)
appengine/datastore/save.go
func saveEntity(defaultAppID string, key *Key, src interface{}) ...
c := make(chan Property, 32)
donec := make(chan struct{})
go func() { ... }
That is, For each entity (analogous to each row in a table), we create and use:
1 goroutine and 2 channels.
The deprecated datastore.Map retrieval bypasses this Channel/Goroutine dance, which is why my response time did not change until I switched to datastore.PropertyList.
Concerns:
Can we do without the goroutines/channels, especially in the API? This way, we can use different implementations.
Alternative solution using iterators
An alternative, equally elegant solution would just use iterators:
For implementations of PropertyIterator:
Since GO Runtime is still experimental, making a contained API change should be ok.
But RPC dominates the overhead per request. Why focus on goroutines/channels use?
Definitely, the RPC time will dominate the overhead from a goroutine and 2 channels. However, we’re talking about potentially 100’s or 1000’s of goroutines per request (equal to the number of “row” returned by, or sent to the API call). E.g. for a GET that returns 100 entities, thats 100 goroutines and 200 channels created to service that 1 API call. And these goroutines/channels we’re making have nothing to do with concurrency: we’re just using this for iterators.
Also, within our application code, we still have to optimize our code (and especially our exported APIs), even though we know that RPC overhead will overshadow it.
Main Concern: Implementation bleeds into the API
My main concern is that this bleeds into the API. By using Iterators,
you can use channels and a goroutine in the implementation, and change
that afterwards, without application users having to know about it.
The alternative implementation proposed above shows how thic can be done using iterators. It’s trivial to implement (in GO code) and you can gain what you want, without restricting your implementation:
However, the API is not tied to an implementation, so you can implement with goroutines/channels, or with a List. User code that passes a PropertyLoadSaver can use whatever is most applicable/optimized for his usecase. For example, in my user code, I can pass PropertyList into each call and will not incur the overhead of goroutines/channels.
Have others solved similar problems using goroutines/channels? Where?
It seems that the use of goroutines/channels as iterators is not done in other similar places:
What is the performance overhead (load on CPU, RAM) with this? Does it scale?
Initially, when I did this, I ran some rudimentary tests to find the maximum number of goroutines I could create on my machine and how much resources it took.
The summary of the results is that, On a 2.0GHz core, I could start a maximum of 5e5 (500,000) goroutines which basically did nothing (beyond that, I got errors). The RAM usage was 2.0GB.
An app engine instance is 600MHz single core with 128MB limit. That’s about a 1⁄4 the CPU and 1⁄20 the memory. (Even my nexus one has way more resources than that.)
In summary, 2.0GHz, 2GB RAM produced 500,000 goroutines max. I wonder how many a 600MHz, 128MB app engine instance would accomodate.
I’d suspect a few thousand goroutines on such a tiny “computer” (600MHz, 128MB) would tax the system. However, it’s really easy to get into such a situation with the current design. If most of the time is spent on RPC (I/O) and CPU load is low, GO can easily support a large number of concurrent requests. 50 concurrent requests each retrieving 200 entities will mean 10,000 goroutines (+20000 channels) at the same time, just serving API requests, and imposed by the SDK runtime (ie not application code which we can control or tune). In this scenario, the runtime is imposing an overhead which does not seem necessary.
If we expect that most people will pass a PropertyList to calls to GetXXX or PutXXX, then the goroutine/channel is completely redundant.
Also, remember that each goroutine allocates an initial stack of 4K, so each goroutine has a cost in memory allocation, which becomes non-trivial under load.
The rudimentary go code used to run this test is available at: