post-thumb

Simulate your datacenters' latency with Linux

But Why?

Once upon a time, roughly 15 years ago, a good old service was put in production on a brand new usecase. All was good and fine and everybody was happy: the team, the few thousand end-users and the one client.

As the platform started receiving more traffic and onboarding more clients, things went west a few times (Black Fridays, sales, Christmas, etc… ) and it became obvious that the platform was simply not cut for this kind of load. Long story short, we got ourselves a good and nice refactoring. But among the strong requirements lied a silent killer (not to say a PITA): the service was to be run on two sites actively – both sites should be kept online and in-sync at any given time.

This is of course impossible1 but hey, it does not mean that we should not try :)
So we tried: we carved out of our monolithical database big strides of datatypes and put them in the appropriate data container. In short:

  • referential data (data that seldom changes) was to be put in read-only databases provisioned by the same files on both sites;
  • cold data was to be put into write-only MongoDB replicated clusters;
  • hot data: IMDG on a single cluster that spans both datacenters.

*[IMDG]: In Memory Data Grid *[ACS]: Access Control Server, one of the functions described in the 3DS protocol *[3DS]: authentication protocol

We chose Hazelcast, formed a cluster across our two sites and as kids with our candy sticks, we cuddled the savage and starved bear mistakenly thinking it was a soft teddy bear.

Cut the chase already

Some years later and after many major outages, we decided to analyse very carefully how Hazelcast behaves in this kind of settings.

The best possible simulation would have been to have a set of machines in both datacenters and provoke some undesired conditions: flaky network, disconnections, latency, etc… For practical reasons, this is not what I chose to do. To keep it all nice and tidy, I decided to run all the hazelcast nodes on my machine, and to simulate several network latencies and failures using some nice tools in Linux.

The purpose of this article is to show how to simulate a network with two distant datacenters. I am not going to describe what I learned about Hazelcast by doing this simulation. That is stashed for a later date :)

So here is what I want to do:

network

Three networks, with clear separation, and a delay of +/- 5ms (10ms for a round-trip) between two of them:

  • 127.1.2.1 127.1.2.2 my first and second node on datacenter 1
  • 127.2.2.1 127.2.2.2 my first and second node on datacenter 2
  • I kept 127.0.0.1 for the application that will use the cluster.

Hereafter, I present each challenge as I encountered them to achieve this. The resulting code to run the simulation can be found here

Let us begin!

Several distinct networks

Create a bunch of local ips

This one is the easiest:

$ ip addr add 127.1.0.0/16 dev lo
$ ip addr add 127.2.0.0/16 dev lo

This will link the given ip range2 to lo, the loopback interface .

On some distros, this is not necessary at all: Any 127._._._ is implicitly tied to the loopback interface.

So basically, we can send packets from any address in 127.1._._ to any other address in 127.2._._ now:

$ ping -c 5 -I 127.1.2.3 127.2.3.4
PING 127.2.3.4 (127.2.3.4) from 127.1.2.3 : 56(84) bytes of data.
64 bytes from 127.2.3.4: icmp_seq=1 ttl=64 time=0.061 ms
64 bytes from 127.2.3.4: icmp_seq=2 ttl=64 time=0.177 ms
64 bytes from 127.2.3.4: icmp_seq=3 ttl=64 time=0.096 ms
64 bytes from 127.2.3.4: icmp_seq=4 ttl=64 time=0.098 ms
64 bytes from 127.2.3.4: icmp_seq=5 ttl=64 time=0.053 ms

--- 127.2.3.4 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4091ms
rtt min/avg/max/mdev = 0.053/0.097/0.177/0.043 ms

ping -I <source> <dest> sends towards <dest> an ICMP packet seeming to come from <source>. Heavy!

*[ICMP]: Internet Control Message Protocol, similar to - and using - IP. ICMP is primarily used to test some aspects of the connectivity.

Make it look real

Neat, we have our addresses!

One small detail that we need to fix… These addresses are backed by the loopback interface, which is a completely virtual, completely local network device, that is: it is tied to nothing physical, and it is a this-machine-only network. Other than that, it looks and feels like a real NIC. It does comply with the TCP/IP protocols for instance.

But in some others, they take some shortcuts. The main one for us is the MTU: loopback interfaces cuts very big ethernet frames for good reasons . This is roughly speaking the amount of data the interface is going to stuff in one Ethernet frame:

  • Too small, and the ratio headers / data is suboptimal.
  • Too high and the chances of losing a frame – and hence the cost of having to repeat it – increases a lot.

We obviously have to strike a balance here. But on a loopback interface, the balance is tipped : the chances of failing are close to 0 so we can collect all the benefits of big frames :

$ ip link list | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 

Here my MTU is 65k, MUCH more than the standard 1500 for a physical NIC, as seen on the other NICs.

So lets correct that:

$ ip link set dev lo mtu 1500

Our IPs do look real enough for our purpose. Now let us draw some boundaries and segregate our newly created networks.

Segregate networks

We are now basically setting a number of iptables rules to enforce our network segregation. If you are familiar with iptables, you can shamelessly skip the next paragraph .

A word on iptables

The purpose of iptables3 is to filter the traffic from your machine. It does so by applying a set of rules to each packet. Each rule defines the kind of packet it is interested in along with an action, which could be many things. So in its simplest form, a rule can ACCEPT or DROP a packet.

I really wish I could leave it to that. Unfortunately, I must give a tad more insight on how iptables works for the sake of this article so please bear with me a little bit more…
This is the simplest view I can give of iptables that serves our purpose:

  • Packets that come and go on your machine all go through some so-called «tables». Tables are builtin and we cannot create new ones4.
    Exactly which one is for the kernel to decide and depends on the direction of the packet.
    Here, we use filter which is always visited; this is where firewalls usually place their rules.
  • A table is a set of chains. filter has two built-in chains : INPUT for packets getting into our machine and OUTPUT for those leaving our machine. Both are empty by default.
  • A chain is a set of rules bundled together: its purpose is to group rules by intent.
    We can create as many chains as we want, but a packet will visit such a custom chain only if we send it there from a built-in chain.
  • Each rule can DROP, ACCEPT a packet, send it to another chain, or exit its chain (this is the so-called RETURN action).
  • In a chain, rules are traversed in sequence.
    For instance, if a rule drops a packet, the subsequent rules will be omitted for this packet.

Back to our example

So as I said, we will work in the filter table. It is perfectly cut to censor packets like we want to do. Here a nice picture of what we want:

iptables

Remember, ips are all on our machine. So each packet visits filter twice: As it gets out of lo from its source IP and as it gets back into lo towards its destination IP.

The chain ISOLATION will be visited to start with. It is empty, but we will dynamically add new rules here to simulate all types of network failures (see simulate failure ).

The chain MULTISITE enforces the network segregation we want to implement, that is:

  • Two things in the same data center can communicate freely.
  • App network can speak only to DC1, DC1 and DC2 can speak freely
  • everything else is dropped, especially the apps <-> DC2 communications.

The code to implement that:

## ISOLATION will host failures.
iptables -t filter -N ISOLATION
# Send anything from and to 127._._._ to ISOLATION from the builtin chains
# (we are being a bit cautious here... INPUT could catch only
# traffic going to 127._._._ and likewise for OUTPUT
# but hey, "belt and suspenders" like we say in France :) )
iptables -t filter -A INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j ISOLATION
iptables -t filter -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j ISOLATION
# ISOLATION is left blank for now.

## MULTISITE creates the expected topography
iptables -t filter -N MULTISITE
# Send anything from and to 127._._._ to MULTISITE from the builtin chains
iptables -t filter -A INPUT  -s 127.0.0.0/8 -d 127.0.0.0/8 -j MULTISITE
iptables -t filter -A OUTPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -j MULTISITE
# Allow applications speaking one with another.
iptables -t filter -A MULTISITE -s 127.0.0.0/16 -d 127.0.0.0/16 -j RETURN
# Apps can find DC1 :              127.0._._   <-> 127.1._._
iptables -t filter -A MULTISITE -s 127.0.0.0/16 -d 127.1.0.0/16 -j RETURN
iptables -t filter -A MULTISITE -s 127.1.0.0/16 -d 127.0.0.0/16 -j RETURN
# Apps can find DC1 :              127.1._._   <-> 127.1._._
iptables -t filter -A MULTISITE -s 127.1.0.0/16 -d 127.1.0.0/16 -j RETURN
# Likewise on DC 2  :              127.2._._   <-> 127.2._._
iptables -t filter -A MULTISITE -s 127.2.0.0/16 -d 127.2.0.0/16 -j RETURN
# Allow DC1 <-> DC2 :              127.1._._   <-> 127.2._._
iptables -t filter -A MULTISITE -s 127.1.0.0/16 -d 127.2.0.0/16 -j RETURN
iptables -t filter -A MULTISITE -s 127.2.0.0/16 -d 127.1.0.0/16 -j RETURN
# Drop anything that did not match so far
# Especially, apps cannot speak to DC2
iptables -t filter -A MULTISITE -j DROP
  • iptables -t filter -N XYZ
    Creates a New rule chain named XYZ in table filter (this table is the default, but I put it nevertheless for the sake of clarity).
  • iptables -t filter -A XYZ -s <ip_source> -d <ip_dest> -j <action>
    Adds a new rule inside XYZ. This new rule :
    1. matches any packet:
      • Sent (-s) from <ip_source> (in CIDR2)
      • AND Due (-d) to <ip_dest>
    2. applies the <action> to it, could be to:
      • DROP or ACCEPT the packet, this decision is final. No other rule will match the packet;
      • stop evaluating the current chain (this is RETURN) and return to the calling chain (INPUT or OUTPUT in our example) for further rule matching;
      • Jump to another chain, by just giving its name as an action (hence the -j)

Let us try it out:

$ ping -c 1 -I 127.0.2.3 127.2.3.4
PING 127.2.3.4 (127.2.3.4) from 127.0.2.3 : 56(84) bytes of data.
ping: sendmsg: Operation not permitted

--- 127.2.3.4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
$ ping -c 1 -I 127.1.2.3 127.1.3.4
PING 127.1.3.4 (127.1.3.4) from 127.1.2.3 : 56(84) bytes of data.
64 bytes from 127.1.3.4: icmp_seq=1 ttl=64 time=0.063 ms

--- 127.1.3.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.063/0.063/0.063/0.000 ms

Seems well enforced to me!5

Add delay

OK, we have our networks and we segregated them. Now let us put some distance between our datacenters:

  • packets must take at least 5 ms to go from DC1 to DC2 (and same delay in the way back of course)
  • the bandwidth between DC1 and DC2 is bounded to 20 Mbps
  • the bandwidth between DC1 and Apps is capped at 2 Gbps

This is exactly what Linux Traffic Control is made to achieve…

*[Mbps]: Megabits per seconds, 125kB/s give or take *[Gbps]: Gigabits per seconds, 125MB/s give or take

If you are familiar with both tc and htb, you can skip to back to our example again

Meet tc and htb

This one is hands down my favorite piece :) …
tc stands for Traffic control, it is both super powerful and quite pedantic in its configuration. The main purpose for this wonderful tool is to enforce QoS rules by prioritizing packets just before they are getting on the wire6.

The simplest example is one class only (more on that shortly), with a pfifo «queue discipline»: this discipline deals with packets as they come, First In First Out.

Another popular one is pfifo_fast, which will honor the DiffServ bits of the packets, and use a fifo for packets with the same priority.

Now, remember the output when we changed the MTU? Here it is if you do not have photographic memory or are too lazy to scroll back:

$ ip link list | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 

Right after the MTU, there is a “qdisc” indication7. This stands for queue discipline and this is exactly what we are describing here.

*[DiffServ]: Differentiated Services, indicates if a given packet needs priority or throughput. For instance, VoIP is flagged as priority and skips some queues on the network. But if your machine queues them, it is worthless…

It can become really hairy with some convoluted disciplines, and by leveraging all the abstractions readily available in tc:

  • «classes» help in grouping packets together for a specific purpose.
  • «filters» can – not unlike iptables – match packets and enqueue them into classes (well, flows, but I’m trying to keep that simple).
  • «handlers» further alter the processing of packets.

For our purpose, we will use a specific discipline called Hierarchical Token Buckets or htb for short. htb is quite popular, and it makes use of the classes. They are nice beast, really. And the site’s CSS let you know this is really geek stuff.

Simply put, htb is a “simple” way of distributing bandwidth amongst different usages:

QoS with HTB
  • SMTP and VoIP are each assigned 1mbps.
  • Everything under “Noise” have to share 1mbps, nothing more.
  • The process repeats under “Noise”

htb also embeds a concept of priority and bandwidth sharing, so that the exceeding bandwidth can be shared in a controlled way. The priority is also helpful when bandwidth is “overbooked”, so that higher priority buckets (lower number) is honored first.

This is a very, very simple example of what tc and htb can do. More here .

Back to our example again

This is what we want:

bounding bandwidth with htb

And this is the calls to tc to get it:

# Boilerplate: Setting up lo with an htb queue discipline
tc qdisc del dev lo root
tc qdisc add dev lo root handle 1: htb
# Class 1:1 (limited to 2 Gbps) will get traffic inside the DC1 and between Apps and DC1
tc class add dev lo parent 1: classid 1:1 htb rate 2gbit
# Class 1:2 (limited to 2 Gbps) will get traffic inside DC2
tc class add dev lo parent 1: classid 1:2 htb rate 2gbit
# Class 1:3 (limited to 20 Mbps) is for traffic between DC1 and DC2
tc class add dev lo parent 1: classid 1:3 htb rate 20mbit
#  DC1 <-> DC1 goes to class 1:1
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.1.0.0/16 match ip src 127.1.0.0/16 flowid 1:1
# apps <-> DC1 goes to class 1:1 too
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.1.0.0/16 match ip src 127.0.0.0/16 flowid 1:1
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.0.0.0/16 match ip src 127.1.0.0/16 flowid 1:1
#  DC2 <-> DC2 goes to class 1:2
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.2.0.0/16 match ip src 127.2.0.0/16 flowid 1:2
#  DC1 <-> DC2 goes to class 1:3
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.1.0.0/16 match ip src 127.2.0.0/16 flowid 1:3
tc filter add dev lo protocol ip parent 1:0 prio 1 u32 match ip dst 127.2.0.0/16 match ip src 127.1.0.0/16 flowid 1:3
# Little handler to add delay (5ms per packet so 10 ms for a roundtrip) and 5% of lost packets
tc qdisc add dev lo parent 1:3 handle 30: netem delay 5ms 1ms 5% loss random 5% # Ajout d'un délai sur le flow 1:3 qui correspond à l'intersite.

The comments should be sufficient to explain what is going on. I won’t dig too much into the syntax as is would lead us too far.

The last line though is interesting: It places a «netem» handler on class 1:3, the one between DC1 and DC2.
This stands for Net Emulation and again, this little devil deserves a full post for what it can offer. Let us settle down to this: we use netem to delay any packet reaching the class this handler is pinned to.

  • delay 5ms 1ms 5% would delay a packet by 5ms ± 1ms, with an uniform distribution. The 5% indicates that a value is not completely chosen at random, but rather depends for a tiny bit on the previous value. This is to avoid too noisy delay, which is not very likely in practice.
  • loss random 5% drops 5% of the packets at random.

All this accounts for a rather sh1tty link between both data centers, right?8 Let us check that…

From DC1 to DC2:

$ ping -q -c 1000 -A -I 127.1.2.3 127.2.3.4
PING 127.2.3.4 (127.2.3.4) from 127.1.2.3 : 56(84) bytes of data.

--- 127.2.3.4 ping statistics ---
1000 packets transmitted, 910 received, 9% packet loss, time 12850ms
rtt min/avg/max/mdev = 8.512/12.020/129.759/6.550 ms, pipe 2, ipg/ewma 12.862/10.847 ms

Packets are handled in their way in and out. So delay is roughly speaking what we expect: 10ms ± 2ms and loss is ~10%.

Within DC2:

$ ping -q -c 1000 -A -I 127.2.2.3 127.2.3.4
PING 127.2.3.4 (127.2.3.4) from 127.2.2.3 : 56(84) bytes of data.

--- 127.2.3.4 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 17ms
rtt min/avg/max/mdev = 0.005/0.007/0.313/0.014 ms, ipg/ewma 0.017/0.006 ms

Pings take no time, there is no loss. Slick!

All in all

Let us take a short moment to sum up what we have done so far:

  • we created a bunch of IPs on the loopback interface
  • we gave this interface a decent MTU for our simulation
  • we forbid some connections between these IPs to simulate several segregated networks
  • we created the ISOLATION chain, a placeholder for new firewall rules to simulate incidents
  • we added some delay and mayhem between one datacenter and the other

Simulate Failure

At long last, here comes the fun! Let us spread some chaos here.

# Everything coming from DC1 towards DC2
iptables -t filter -I ISOLATION 1 -s 127.1.0.0/16 -d 127.2.0.0/16 -j DROP
# Everything coming from DC2 towards DC1
iptables -t filter -I ISOLATION 1 -s 127.2.0.0/16 -d 127.1.0.0/16 -j DROP

Cold blood murder of any packet trying to get from one datacenter to the other… Rude.

The construct iptables ... -I <CHAIN> 1 ... Inserts a rule in position 1 into the rule. This ensures that these rules are at the top of the chain.

You can revert that with:

iptables -t filter -D ISOLATION -s 127.1.0.0/16 -d 127.2.0.0/16 -j DROP
iptables -t filter -D ISOLATION -s 127.2.0.0/16 -d 127.1.0.0/16 -j DROP

The construct iptables ... -D <CHAIN> ... Deletes a rule. Since rules have no names, you have to repeat every options that were used to create it…

Insulating a node

Same logic. You just need to close the door both ways. This means two rules:

# Everything coming from that IP
iptables -t filter -I ISOLATION 1 -s 127.1.2.3 -j DROP
# Everything coming going to that IP
iptables -t filter -I ISOLATION 1 -d 127.1.2.3 -j DROP

And this is how to revert all this:

iptables -t filter -D ISOLATION -s 127.1.2.3 -j DROP
iptables -t filter -D ISOLATION -d 127.1.2.3 -j DROP

Strange failures

Building up on the above, I’m pretty sure that you can come up with your very own particular situations:

  • packets killed only one way (e.g. result of a bad configuration like a bad firewall rule)
  • jittering link, using watch or sleep to kill and restore your connection
  • very distant DC, by using a 500ms delay (this is the “distance” between Sydney and NY for instance)

… and so on and so forth

Be wary though not to create unrealistic situations, like 0ms one way, and 100ms the other way between two sites.

Conclusion

There is some tools to do all that easily out there9. As far as I can tell, they all use tc with netem, as this is «the simplest way» to induce a controlled amount of delay in the network (excluding a dumb proxy over a bad VPN).
But once you have spent some time with tc, iptables and ip are not that frightful.

So yes, I could have used any tool, spun a set of VMs on two separate zones in a public cloud.
And yes you can call this Stockholm syndrome, but nothing is as fun as using programs available on almost all Linux distribution :)

We have covered some distance in here.
The main objective was to simulate simply and in a lightweight manner three networks, one of which is at some distance.

To do all that:

  • we dug a bit into the loopback interface
  • we dealt with iptables to drop some packets
  • we used tc, htb and netem to delay and drop stochastically some of our packets.

Lastly, we saw briefly how to induce some failures in this topology. Remember, I did all this to gain insights on how some key properties of Hazelcast work.
There were some interesting conclusions, but this is a story for another article!

(Most of the code is available here )


*[stochastically]: “at random” but sounds smarter, so… *[NIC]: Network Interface Card, the hardware part we plug the ethernet to. *[MTU]: Maximal Transmission Unit *[QoS]: Quality of Service. In network, prioritize certain traffic, enforce a minimum bandwidth, etc…


  1. The byzantine generals problem is well known and well described impossible problem . ↩︎

  2. Remember that an IPv4 is in the form <8 bit number>.<8 bit number>.<8 bit number>.<8 bit number>. Now a.b.c.d/X is a network composed of all the addresses starting with the first X bits of the mentioned IP. So 127.0.0.0/8 are all the addresses from 127.0.0.0 to 127.255.255.255. See cidr notation for much more, and this simple calculator to get a feeling. ↩︎ ↩︎

  3. A complete description of iptables is well beyond this post. Refer to this comprehensive article if you need more of iptables. Who knows, could save your life someday! ↩︎

  4. yes, we can create new tables, and yes it is a very bad idea. Distros might do that in some very special cases, but we mere mortals should avoid doing that. ↩︎

  5. We use ping here, which sends ICMP packets which are very different from TCP. But since it sits on IP and we did not rely on ports – a TCP concept – for our iptables rules, it will be proof enough for us. ↩︎

  6. A so called «discipline» can be put on incoming or outgoing packets, but here we are only interested in incoming packets only, and as far a I know, use cases for incoming packets are quite uncommon. ↩︎

  7. The noqueue discipline means “send the packet if the hardware is ready, or else drop it” (The packet, not the hardware). This is usually the default for virtual devices, like docker0 and lo here.
    The fq_codel is an intelligent default and strikes a good balance most of the time…
    On the other hand, I am on a VM so enp0s3 is also virtual, but my guest OS has no clue, and the queue here is mostly useless anyway… I’ll fix that sometime :) ↩︎

  8. Well we did not unleash all hell upon that link, and there is a lot more «netem» can do to our delicate packets, like corruption, reordering, or duplication.
    Although very tempting, bear in mind that TCP has you back in most of these scenarios so adding them does not change the behaviour of your link that much. Even packet dropping is somewhat unnecessary here, as the link will work or not, but without any major quirks.
    Now if you use UDP, you should use «netem» at some point to strengthen your tests, since duplicating / reordering packets will be challenging for your application. ↩︎

  9. These two are very interesting and use the same ideas I develop here:

    • Saboteur , a small program inducing failure and latency on the network. I used it a while back to test Hystrix with some friends .
      Unfortunately, both Saboteur and Hystrix seems staled for some years now.
    • Pumba , a Docker image to induce failure and delay on running containers.
      This one seems pretty active as of 2020.
     ↩︎