Background: An Architecture for Millions of Things

(

Jul

)

Note: I'm working on a mind-blowing, super secret update. I don't know how long it will take so I will bridge the time in between with a series of more abstract posts, covering some of my current thoughts!

Also, soon I will be moving countries again, can you guess where I'm going?

In this blog post I will again talk about my current development of fundamental technology for Citybound. I want to make sure that Citybound will be about millions of things interacting and not just thousands.

This requires not just the quantitative kind of effort like "this will need to be optimized by X%", but it actually requires a qualitatively different architecture from the very beginning. Let me tell you how I found it and what it's all about.

This post is a long one (~10 min read), but it will teach you important things about computers using some nice metaphors and a little cynicism.

1. Laying out Things in Memory

Disclaimer: if you already know everything here, feel free to read this section faster. If you feel like nothing makes sense, feel free to skip to section 2 of this blogpost, which is much cooler. No one will judge you for it.

If you think about it, simulating as many things in Citybound as possible just means taking as little time as possible to simulate each individual thing. The time required depends on:

how much extra work needs to be done because of stupid algorithms
how much extra work needs to be done because of very generic solutions applied to very specialized problems
The harsh physical reality of how today's computers work

Before even starting to write stupid algorithms, I decided to at least get 2. and 3. under control and focused on the problem in computing nowadays:

RAM alone is about 400 times too slow at feeding the CPU with data. The ugly kludge (that became normal) is to introduce layers of caches, which are successively tinier but faster memory units that give you a tunnel vision copy of a part of the RAM at an acceptable access speed.

As you can imagine, the challenge of still seeing important information when you have tunnel vision is to find a very good strategy for choosing where you move your center of attention next, ideally based on what you're currently seeing or what you just saw a couple moments ago.

This is in fact exactly what a modern CPU does: ahead of time, it shifts its center of attention to the parts of memory that your program probably wants to look at next.

If the CPU predicts this place correctly, your program will be able to get the information it needs very quickly and everything is good.

If the CPU mispredicts (this is called a cache-miss), it now has to fix its mistake and slooooowly bring the data that your program actually wanted into the cache/center of attention - all while your program is already waiting and can't do anything useful.

The problem is that the center-of-attention-prediction-rules of the CPU have to be very simple, else the CPU might spend too much time just thinking about these, not doing anything useful anymore. Another problem is that all the very different programs running on your computer can have very different memory access patterns and even if you tried to be very clever, it would be hard to come up with better general prediction rules than the ones we have.

This is the first example of a lowest-common-denominator general solution to a problem, where almost everybody trades off lost potential for not having to implement something in a specialized way themselves.

In the case of CPUs we are basically forced to accept and work around the tradeoff, in software we have more luck, as you will see further down.

To understand what we need to work with, let's have a look at the very few CPU center-of-attention prediction rules:

if your program just looked at some data it will probably also want to look at data around it
if your program just looked at some data it will probably want to look at that same data again very soon
if your program just looked at data here and then at data 135 bytes further to the right, it probably wants to do a walk through memory with a step size of 135 bytes

So if you want to write a really fast program, it all comes down to following these clichés that the CPU has about programs.

Now, if you think about it, this challenge is not even that much about your program, but mostly about how you lay out your data in memory, since that will determine the access patterns your program has to do.

In general, it seems like a good idea to keep related data (for example the properties of a building, or the list of all buildings) as close together as possible, in a contiguous chunk of memory, without gaps and laid out in fixed step sizes if you want to walk over that data.

Additionally, you really want to make sure that you store your data as compactly as feasible, not because of RAM limitations (there is plenty), but because you want to cram as much useful data into the tiny center of attention at once as possible, so you have to do less of this very expensive moving of attention. But don't try to be too clever! Your data can only be as compact as you can still read it directly without hassle - else you will just spend time trying to decipher overly compressed data instead.

Side note: this is just one example of the many almost Zen-like conflicts you encounter in high-performance computing. Often, all you can really do is trade off, rarely you can find a solution that transcends the conflict and improves both constraining metrics at the same time.

So far, this is actually very widely understood and common practice in video games, computer graphics and scientific simulation whenever you need to simulate millions of something very fast.

The more you read about the subject, the more you will notice, however, in how many cases the things that you have millions of are exactly identical to each other in one very important way: the size of the data representing each thing.

Particles are a good example: you can simulate a fluid with millions of particles that each consist of exactly a position and a mass, not more, not less. You can draw impressive game graphics consisting of millions of triangles, each consisting of exactly three points and some fixed set of properties, not more, not less. You can simulate a huge flock of birds, each having only a direction and a speed, not more not less.

You get the idea. But what am I hinting at, what's missing? And why is it relevant for Citybound?

To say it abstractly: it's already impressive that we can simulate millions of very similar things, but everything that is interesting about Citybound will come from the fact that it will simulate millions of unique individuals, having some similarities, but interacting in amazingly complicated, lifelike ways exactly because of their differences.

Like a corner shop might sell 10 different kinds of goods, but a supermarket might sell 100 different kinds of goods. They're still both shops. Even worse: the same supermarket at one point might serve 100 customers, later 15 and on a special occasion more than 300! How to fit that into any reasonable fixed amount of memory except by wasting a lot of space most of the time to accommodate the worst case?

The common practice solution to this problem is to separate out the differently-sized or growing/shrinking parts (let's call them "dynamic" parts) of a thing from the constant-size part: You simply ask the memory allocator of your system to give you a custom-tailored region somewhere in memory for each dynamic part and you just store a (constant-size) reference to that region in your constant-size part. If the dynamic part outgrows its tailored region, you simply copy it to a bigger tailored region, give the small region back to the system and update your reference accordingly.

That all sounds very neat and works great for most programs, the problem is exactly the somewhere in memory part. Memory allocators are very keen on keeping the overall memory at least somewhat clean and organized to be able to fulfill the requests of, once again, all kinds of programs, without wasting a lot of space.

The memory allocator, similarly to the CPU cache-predictor, is forced to be incredibly dumb to be fast enough. So it has basically no helpful assumptions about when your program will need how much memory, whatsoever.

This is the second example of a lowest-common-denominator general solution to a problem, but this time, we'll be able to do something about it!

The result is that the dynamic parts of things (whose const-size parts you laid out neatly in succession) actually end up all over the place in memory, out of order, or even worse, just the dynamic parts of one thing end up in very different places. (The commonly used nickname for this free-for-all system-managed area of memory is "heap". Seems appropriate, now that I think about it.)

If you now remember our CPU access-prediction clichés, you will realize that our all-over-the-place dynamic data flies completely in the face of the expectations of the CPU, leading to lots and lots of painfully slow cache misses.

But even for this bad job the allocator is (maybe forced to) doing, it needs huge amounts of bookkeeping that make asking it for memory regions even more painfully slow.

That is why high-performance programmers usually avoid asking the allocator for something like the pest and try to organize stuff themselves in just one or a couple huge regions.

If you do memory management yourself in this way, and you want something more complicated than collections of constant-size things, it looks like you have to invest a lot of work and you will just recreate the bad general-case solution of memory management.

That doesn't have to be true, however. By looking very carefully at what I need, I came up with a custom memory-management system, specialized for Citybound that offers great fit with CPU clichés with almost no bookkeeping required, that is very straightforward to implement.

Here is the simple gist of it:

dynamic parts of a thing always are stored right after its constant-size part, in a contiguous chunk
things of roughly similar total chunk sizes (constant + dynamic part) are stored in one "bucket", laid out contiguously in fixed-size slots, with a little leeway for growth inside
there are buckets for each existing size of a thing
if a thing outgrows the slot size of its bucket, it simply migrates to the bucket with the next bigger slot size

And the best thing is that thinking about this led me to the following discovery...

2. Actors + Message Passing

The way I was thinking about how to update the game state so far was pretty much in terms of loops over lists of game things that update each one and extra loops to handle interactions between things of type A and things of type B. Every thing needed in theory to be able to see every other thing, something that can get very hairy once you do parallel processing. My last post presented a kind of brute-force solution to it that still had its limits and drawbacks.

Overall, I was using a top-down approach to implementing the complexity of a city, hopelessly complicated, hard to reason about, hard to change. The motivation was performance. Deep down, I knew of a better way to do it, but it took me a while to see that it was actually feasible to use it in a fast way.

This other way of doing things is called Actor-oriented programming (it should actually be called Object-oriented programming, but this term was stolen and changed its meaning).

An Actor is like a tiny computer. It has all of its state hidden inside itself and can communicate only by sending messages to other Actors.

To be completely precise, all an actor can do is to react to a message arriving, by changing its internal state, by sending out messages itself, by spawning some new actors and by killing itself.

The best example for a programming language that builds upon Actors + Message Passing is Erlang, used with great success for three decades for all kinds of super-fault-tolerant and hugely-scalable systems. I had some great fun with it in the past!

To give an example from Citybound: a stretch of road will be an Actor. It knows its length and geometry and contains all the cars that are currently on it. It knows the Actors of the following and previous stretches of road that it is connected to. When it receives a "time passed!"-message, it updates its cars by moving them along itself according to physical and psychological models. Should it happen that a car goes over the end of the stretch of road, it simply sends the car (inside a "new car arrived!"-message) to the following stretch of road, which in turn will react by incorporating that car into its list of cars and now taking further care of it.

Now let's look at some magical properties of this approach:

It makes no difference for the Actor where the recipient Actor of its message is located. It might be running on another core of your CPU. As long as someone delivers the message there, eventually, everything will work fine. The recipient actor might even be running on a different computer, on the other side of the earth. Yes, Actors + Message Passing is the profound pattern that will almost trivially allow distributed simulation of a city, with the "seam-lines" of computer boundaries between arbitrary things in the simulation.
The Actor doesn't really care how exactly the recipient Actor of its messages work, or what it is. For example, the "following stretch of road" might not actually be a stretch of road at all, but a garage instead. Or a ferry. All that matters is that it can accept a "new car arrived!"-message and continue with its own behavior from there. Yes again, Actors + Message Passing is the profound pattern that allows this huge amount of flexibility and open-endedness, while still giving some guarantees. This alone makes some interactions between different kinds of city systems that felt impossibly complicated to me suddenly trivial to implement. Things like: how does a citizen go to a car, drive to a shop, do a transaction between his household and the shop's company…? Or: how does a pedestrian on a sidewalk behave completely differently than a pedestrian in a park? It's simple if you choose the right subdivision into Actors and Messages. This is also what will make modding very obvious and easy.
If there's a bug in an Actor, it can just crash without bringing the rest of the simulation down. Some other part of the simulation probably knows how to restart the local situation. And: the bug should be pretty easy to find and fix, since each Actor doesn't really do that much complicated stuff.

I already hinted at the reason why my subconscious didn't let me see that opportunity earlier: actually based on my Erlang experience, I had the assumption that Actors and Message Passing were strictly linked to a high memory overhead and bookkeeping effort (from a high-performance computing standpoint).

When I combined Actors with my new ideas about memory management from the first part of this blog post, I realized, however, that yes, I could really have my cake and eat it too! I even discovered some further magic surprises!

If every Actor keeps its own state and all I ever do is update Actors, then all I ever need is some tiny "scratch memory" for intermediate calculation results during each actor update, which can just be completely wiped after the update is done (the results were either stored in Actor state again or sent out as Messages). No need for any kind of bookkeeping-intensive, all-over-the-place heap at all! (Only the custom memory management described above, which can actually be used for both the Actors and the Messages!)
The Actors just really don't care about their computational environment at all: how their state is stored, how they receive messages, etc. This means that I can experiment with and gradually optimize the Actor storage system as well as the Message delivery system, without touching the Actor behavior code at all. Things like loading or saving actor state to a savegame, using memory-mapped files to do so efficiently, employing double buffering, capturing the history of an Actor, ... can all be swapped in and out of the architecture for various performance, testing, debugging or development reasons without affecting the Actor's behavior and thus the simulation's *semantic behavior in the slightest.

This is big. Citybound is not only better off because of this discovery, it might actually only be possible because of it.

Let me know what you think!

→ Discussion on /r/Citybound