Protocol for broadcasting to a lot of users quickly

Started by
12 comments, last by JohannesDeml 3 years, 5 months ago

Hello there,

I want to create a realtime multiplayer game, that supports a lot of concurrent players. The requirements are probably quite similar to Curiosity - What's inside the cube?. Taps from a player should be broadcasted to all other players. The messages will contain only only a small data chunk, but the server should support handling a lot of taps and should also be able to broadcast them quickly.

I'm quite experienced in developing single player games, but not so much in developing multiplayer games, so I'm struggling with choosing the right technology for the following requirements:

  • Fast broadcasting of small messages
  • Supporting a lot of concurrent players
  • Sub-channels for broadcasting groups
  • Support iOS and android clients built with Unity
  • Server running on linux

I'm wondering about the following:

  • Is TCP fast enough or will I need to build on UDP?
  • RPC seems like one way to go (maybe gRPC?), but when broadcasting a tap I won't need the clients to respond the server with any value, can I disable that somehow or is the overhead of sending Empty responses negligible or doesn't even happen?
  • If RPCs have too much overhead, I guess I will need to build on sockets. How much overhead is building up directly on sockets or are there any other solutions you can recommend?
  • What do I need to consider when choosing the programming language for the server application? The languages I'm thinking about are Java, go and C#.
  • Is there anything I need to consider when choosing a VPS / Root Server?
  • Can you recommend any papers/articles on the topic I should read?

I found the following technologies that might be interesting:

Thanks for your input!

None

Advertisement
  • Is TCP fast enough or will I need to build on UDP?

TCP is no “faster” or “slower” than UDP.

TCP has different characteristics than UDP, so it requires significant amounts of buffering and session state on the server per-client, where UDP lets you get away with less. TCP also has the “head of line” blocking problem in case a packet gets dropped in transit.

RPC seems like one way to go

Most RPC systems are too high overhead, because they track too much state.

RPC systems also solve a separate problem, which is “how do I structure the payload inside a packet so the other end can make sense of it,” but that's not really the challenge you have here.

How much overhead is building up directly on sockets

What do you mean by “overhead?”

You write more code when going to the underlying sockets layer, but you end up having more control, and generally can end up with lower server load, and less packet overhead, assuming you do at least as competent job with implementation as your higher-layer library would have done.

Note that almost all “high profile” networked games have people on it who have significant experience and skill in this area. The approach of “I'll use some existing library/technology, and then figure out how to make it good enough” that works so well in web development and enterprise IT, is not at all how high-end games are built. And if your game tries to scale up past the “usual envelope” on some particular scale (such as “number of people in a single simulation”) then you're trying to build a high-end game.

What do I need to consider when choosing the programming language

If you want to scale to very high numbers of players, then you need a language with low overhead, and a good I/O model. If you want latencies to be low, you want a language without garbage collection. Unfortunately, that counts out all three of the languages you choose. Go is especially bad, because its JSON and Protobuf and other similar packetizing solutions are quite slow, and its GC is much less mature than that of Java and C#. Java has traditionally had a poor asynchronous I/O model, meaning that C# is probably the least bad of the three, but if you want to push the envelope, know that previous games in this genre have been written in C or C++. (I guess Rust would be a modern, non-garbage-collected, system with a good io-uring support library, if you can't stand the C/C++ combo.)

Is there anything I need to consider when choosing a VPS

Measure the latency of the scheduler over time – the hypervisor may have significant random jitter. This is especially true on smaller instances, and cheaper hosts.

Measure the actual throughput peering with the networks you're interested in. Some cheap VPS I've used had “free 3 TB bandwidth” but the achievable throughput per connection out to the US west coast from the east coast was about 50 kB/second.

Make sure the networking infrastructure actually supports the kind of load you want to make. Some providers focus on web traffic, and do really poorly for UDP and real-time traffic.

Can you recommend any papers/articles

I mean, it's all right there, in the man pages for UDP sockets …

… the trick being: trying to decipher what the terse, accurate language about kernel and implementation behavior actually means when scaled up to a large system!

If I built a system where “every player” needed to see some kind of state that's derived from “every other player,” then I would build it on UDP. I would start with a single socket, and a single thread. I would set the input and output buffers of that socket really large (10 MB or more,) I would use two threads, one for reading, and one for writing. I would use blocking I/O. I would use a non-blocking primitive of some sort between the threads – you have to discard data at some point, if you back up, and your choice is “getting an asynchronous failure when writing to the kernel” (non-blocking mode) or “under control of your program” (blocking I/O, internal non-blocking queues.)

OK, the threads would simply receive all the incoming packets with their source addresses, and send all the outgoing state packets to all the known addresses, respectively. Because you will likely get many packets in for each time it takes to cycle through the output address list, you need to aggregate updates. This is also necessary to avoid an N-squared growth in number of players – if every player sends an input, and ever other player needs to see that specific input, you have an irredeemable N-squared problem.

So, it might look something like this:

int udpSocket;

void reader() {
  char buf[MAX_PACKET_SIZE];
  char addrbuf[MAX_ADDRESS_SIZE];
  while (true) {
    socklen_t addrsize = MAX_ADDRESS_SIZE;
    int sizeRecv = recvfrom(udpSocket, buf, MAX_PACKET_SIZE, 0, &addrbuf, &addrsize);
    handle_error(sizeRecv, "recvfrom");
    playerId = maybe_insert_address(addrbuf, addrsize);
    update_state_based_on_packet(buf, sizeRecv, playerId);
  }
}

void writer() {
  char buf[MAX_PACKET_SIZE];
  while (true) {
    auto timeThen = my_clock();
    int sizeSend = snapshot_game_state_into(buf);
    for (int i = 0; i != MAX_PLAYER_COUNT; ++i) {
      if (gPlayers[i].active) {
        int w = sendto(udpSocket, buf, sizeSend, 0, gPlayers[i].address, gPlayers[i].addressSize);
        handle_error(w, "sendto");
      }
    }
    auto timeDelta = my_clock() - timeThen;
    if (timeDelta < MINIMUM_TIME_BETWEEN_PACKETS) {
      usleep(to_microseconds(MINIMUM_TIME_BETWEEN_PACKETS - timeDelta));
    }
  }
}

Is this syntactically correct code that will compile first try? No :-)

If you find that your CPU runs flat out on the two threads you have, and you're still not saturating your network interface (could happen with a 10 Gbps or higher network interface, and/or if your implementation is less than efficient) then you can perhaps open multiple sockets, and run one of these pairs of loops per socket. You'd have to figure out how to make your different clients send to the different ports of those different sockets somehow. You'd still end up being bound on the “update state based on packet” global state update; depending on how complex that function is, that may be easy to scale across cores, or not.

In general, for very high throughput servers, locking is the enemy. You should be able to implement “maybe insert address” and “gPlayers” in a non-blocking manner. Especially if it's OK that some player gets one additional identical packet through a cycle, meaning you don't need locking primitives across management of the gPlayers array, just make sure to keep the “active” flag properly updated (and memory sequenced, if you're on ARM or Itanium or some such where that matters.)

Now, maybe I'm making some bad assumptions. I'm assuming the “simulation” (or “game state update”) is simple – like a “counter” – such that “who affected the counter at what step” generally doesn't matter. I also assume that “a lot of players” is a goal of 10,000 players or more, all affecting a single state. If “a lot” means a hundred players, then efficiency doesn't matter much, unless you're building some kind of fancy physical simulation, similar to a modern FPS game.

Good luck with the game!

enum Bool { True, False, FileNotFound };

@hplus0603 What do you think of Erlang as a network platform? I've only worked with its descendant Elixir but it seems to have extremely powerful capabilities. Is it usable or being used for actual game servers to your knowledge?

At a previous place of work, I improved networking performance by 100-1000x in performance, and 20x in resource consumption, by rewriting some PHP/memcache/Perl thing in Erlang. It's pretty good, for the right use case.

Erlang is garbage collected, but each “process” has its own local heap. This means that, as long as you structure things to be lots of small processes, GC overhead isn't that bad. But, a single process needs to be a single socket, so you can't structure this particular application as an Erlang application that scales well, without doing some unnatural acts.

Erlang CPU overhead is also kind-of high, because even with HIPE VM, the code generated is not that great. It doesn't have the type annotations of Haskell or Rust, and it doesn't have the large team of very dedicated VM engineers that JavaScript does.

The “upgrade in place” support that Erlang has, is nice, if you really cannot have any downtime. It is possible to write, and test, upgrade and downgrade in place, in a running Erlang cluster. You really can get to nominal 100% uptime. However, the additional engineering burden of developing and testing the upgrade and downgrade paths is SIGNIFICANT. You only want to pay that cost if you really, honestly, can't live with 30 seconds downtime when switching over server versions. And, even so, you may be better off writing pairwise-compatible versions, and do green/blue deployment switch-over. That still needs testing, though – constant uptime is a feature, and as such, it has a cost.

If this application was a “multi-user chat system” then Erlang could be swell. But, given that there's one, central, piece of state, Erlang will likely limit you based on how quickly you can marshal messages into, and out of, the single process that just updates that state. It may be that, with sufficiently simple data, you could actually make that process run faster than your network card can sustain, and dedicate a core for just that process. If so, you could split the users on some number of sockets, and use an Erlang process for each, and probably scale alright across CPU cores, but I would expect the overall “constant factor” to be higher than that of a C/C++/Rust server, and I would expect the Erlang garbage collector to still get in the way, albeit only affecting, say, one twentieth of your users at one time, if you have twenty sockets.

enum Bool { True, False, FileNotFound };

@hplus0603 Wow, thanks a lot for the extensive answer, this really got some of my assumptions straightened up. I thought I will be needing to think a lot in terms of load balancing and juggling around with a lot of open sockets.

This is still a lot to process for me, but here are some of my thoughts:

I will use a TCP socket for the initial connection to the server, so players can register & login and checks for compatible version and so on are made. From there they get a session token and the UDP port address. The players communicate via UDP sending their actions and the session token (and maybe a tick value). The server has a read thread that gets all the player input, validates it and puts it in a buffer. The server has a tickrate with which it sends out updates of the players input actions. Each tick it takes the read buffer, and sends out all changes made in that tick to the players for which it is relevant. At login (and maybe every x ticks) the server also sends a complete status quo of the playfield.

When the client gets an update from the server with their changes not being recognized, it re-sends the action to make sure it was not lost along the way (packet loss).

Additionally, since everything happens in memory, there should be a data persist task every n ticks. The data will be written to a database and snapshot of the database are created as backups every m ticks.

Does this sound like a good basic start point, or is sending diffs through UDP not a good idea due to packet loss? Also, should I think about compressing the packages? I will try to keep the data as condense as possible, but maybe DEFLATE or something similar would make sense? I have to note, that the sent packages will be different for different players (imagine that there are local and global rooms, so not every input is relevant for every player).

Thank you for all the input so far!

None

That sounds like a good starting point!

Sending diffs from last-acknowledged state is a reasonable way of structuring packets. If the server heard from the client that it acknowledged time T, it can include any deltas from T+1 and onwards in each outgoing packet. This means that a single dropped packet or two won't be so bad, because the following updates will still contain the data.

Deflate might work a little bit, but game state often has significant entropy (coordinates that change, bit masks for inputs and entity states, etc,) and each individual packet must be its own stream, because otherwise you lose sync when a single packet is lost. Generally, if you find that deflate gives you lots of gains, it's likely that you can do something smarter in your network packet structure to get those gains in how you structure the packet instead, and probably make those gains bigger than the general-purpose deflate algorithm would give you.

Some systems use a “known dictionary” for the deflate algorithm, but, again, you're likely to be able to exploit knowledge you have about your network packets better than a general-purpose adaptive compressor could.

enum Bool { True, False, FileNotFound };

Thanks a lot for your additional input! I tried designing some of the message structures, and you're right, the information can be densely packed, so deflate won't be necessary (at least for the fast UDP channel where speed matters).

I went ahead and looked for already existing solutions out there and found the following two which should be good starting points:

https://github.com/chronoxor/CppServer

https://github.com/chronoxor/NetCoreServer

I also evaluated them with Unity as a client and got a working solution:

https://github.com/JohannesDeml/NetCoreServerUnityClient

Since networking is not my expertise I decided to stick with the language I'm good in, which is C#. Even though the thought of building the best possible solution is quite tempting, I think I would get stuck in trying.

One thing I'm still not quite sure about is testing the actual performance. What is the best setup? Localhost, local network or internet server? For comparison for possible throughput I found iPerf3: https://github.com/esnet/iperf

None

Load testing varies entirely by what your game really does.

If you can generate just a known structure, and blast it at the process, then a localhost might be enough. (Assuming you're confident your network card and stack is up to the task already.)

Otherwise, a fleet of stripped-down game clients, running on remote VMs, is the “best way” of doing it, because it is a much better measurement of the overall system. The draw-back is that it's much harder to set up and costs more to run.

enum Bool { True, False, FileNotFound };

Thanks again for all your input hplus!

I went ahead and made a small benchmark for different libraries that can be used with Unity and with .Net Core:
https://github.com/JohannesDeml/NetCoreNetworkBenchmark

PingPong Benchmark with 1 million messages and one parallel message per client

The benchmark is running on localhost and pingpongs messages between the clients and server, either for a given time or a given target number of messages the clients receive.
To generate stable statistics, I use BenchmarkDotNet to run the predefined tests multiple times.

I'm quite surprised how bad NetCoreServer is performing against the ENet C# Wrapper.
Another thing I did not expect, was the difference in performance on linux compared to windows. Linux seems to handle the threads a lot better compared to windows.

I will do some other tests with garbage generation and different runtimes (.NET 5, maybe mono?).

I would also like to test maximum CCU, but I'm not sure how that test would look like. Any suggestions?
Otherwise, I would love to get feedback on the benchmark. Is there anything I should change to get more valuable insights?

Cheers
Johannes

None

Thanks for posting the benchmark!

The Linux kernel is engineered for lower overhead on networking. It's not quite as hard-core as FreeBSD, but very close. Meanwhile, the Windows kernel has a higher focus on … “enterprise IT,” in the way it was understood back in the on-premise physical-machine-closets sense. Both are good kernels, but this is an area where Linux has seen more focused attention than NT. Meanwhile, the user-to-kernel API is generally considered better on the NT side, where asynchronous (as opposed to non-blocking) I/O was built into the system from the beginning, leading to an API like GetQueuedIOCompletionStatus() which Linux has slowly moved towards with the successive refinements of various poll- and event-based APIs. What we can tell from your benchmark is that quality of “localhost” network implentation matters more than cost of userspace-kernelspace transitions.

If you use a UDP socket, maximum CCU might not depend on your networking very much, because it's still all going through that one same socket, and the quality of your “IP address to existing user/session” hash table implementation is going to matter more, as will the amount of work you need to do per-user, especially the amount of work in the network-receive thread, which suffers Amdahl's Law.

Regarding what else to measure, using real network interfaces and hardware would be the next step. Localhost is not a typical network interface. Also, on most current systems, you should be able to sustain line rate on a 1 Gbps network interface, so the question is more “how big do the packets need to be for each player, and how often do you send them?"

enum Bool { True, False, FileNotFound };

This topic is closed to new replies.

Advertisement