Fauna – Where’s the Drop?

Posted by on June 11th, 2012

In order to understand this fauna you will first need to know a little bit about Boundary’s streaming system.  Customers connect their meters up to a set of hosts that we refer to internally as the collectors.  The collectors terminate the connections from the meters and expose them internally as an array of pubsub endpoints, one per customer.

We track the health of many of our components in the streaming system by comparing relative traffic levels of input and output.  For instance, if we are receiving 100 mbps of ingress from our customers, the collector egress should be somewhere in the vicinity of 440 mbps.  In other words, the relation of ingress to egress ought to be a 4.4x multiplier. So imagine our surprise when we saw this.

This graph tells us that the collectors are running at a 16x multiplier.  What was most interesting about this finding was what happened when we restarted a collector.  We found that the ratio would drop back to normal.  Therefore, this is something that builds up over a long period of time due to some sort of leak in the code.  Our prime suspect for finding the leak was the pubsub system in the collectors.  In particular, the theory was that pubsub handlers weren’t getting cleaned up when the receiving side went away.  After some spelunking in the code we found what we believed was the issue.

@@ -38,6 +38,7 @@ drain(Pid) ->
 %% gen_server callbacks
 init([OrgMatch, RemotePid, Version]) ->
+ process_flag(trap_exit, true),
  timer:send_interval(1000, self(), export),
  {ok, #state{pid=RemotePid,org=OrgMatch,version=Version}}.

@@ -57,7 +58,9 @@ handle_info({packet, OrgId, _MeterId, Packet}, State=#state{pid=Pid,version=1,or
 send_packets(1, Pid, OrgId, [Packet]),
 {noreply, State};
 handle_info({packet, OrgId, MeterId, Packet}, State=#state{version=Version}) ->
- {noreply, apply_packet(translate_packet(Version, MeterId, OrgId, Packet), State)}.
+ {noreply, apply_packet(translate_packet(Version, MeterId, OrgId, Packet), State)};
+handle_info({'EXIT',_,_}, State) ->

The issue pointed out in this diff is a subtlety in the way that Erlang handles link breakages.  Link breakages get handled via an exit signal, which is an out of band message that gets handled by the Erlang runtime system.  When the exit reason is “normal” the runtime system will not kill the receiving process.  The fix, therefore, is to tell the runtime that we are interested in all error messages.  That’s what the like “process_flag(trap_exit, true)” does.  The rest is just message handling to explicitly stop the process when a link break happens.  With the code fixed, we proceeded to do a deploy.  The new behavior would be loaded up via a hot code upgrade, however in order to clean out the leaked subscriptions we were going to do a rolling restart of the collectors.

The deploy started fine, but then we started to see a huge egress spike on collector 6.  After a bit of investigation we found out that this was due to the load balancing of connections across the collectors.  As collectors were restarted, connections got shifted onto other ones.  It was so pronounced on collector 6 because it was still sending so much data over the leaked subscriptions.  We proceeded with the rolling restart of the collectors, finally clearing out all the leaked subscriptions.

What are the lessons here?  As usual, the most interesting bugs only show up after months of production use.  The other takeaway is that traffic ratios for services are a powerful way for us to express and reason about aggregate behavior in heterogeneous distributed applications.  Keep an eye out for some new product capabilities based around this concept in the near future.


May Tech Talk Videos

Posted by on May 29th, 2012

dropped knowledge

A big thanks to our May tech talk speakers Jeff Hodges and Dietrich Featherston.  This meetup had the highest turn out so far, so thanks to everyone who showed up and made it a great evening.  Without further ado, here are the videos.


Visualizing Network Flow Data

Posted by on May 14th, 2012

A soon-to-be-released Boundary feature will display a network view of real-time flow between applications. We’re experimenting with ways to visualize this data, and using D3 we’ve produced SVG visualizations of transmitted and received data based on research by Danny Holten and Jarke J. van Wijk:

Tapered Edges

Try dragging the circles around the window! View source on JSFiddle

Curved & Tapered Edges

View source on JSFiddle

A few details about these visualizations:

  • The thick end of the flow line is the source. Holten and van Wijk demonstrate that users can recognize connections faster with tapered edges than with arrows (and hopefully you did too).
  • Color and thickness represent the strength of the connection. The strongest connection in the group serves as the max value and all connections are scaled relative to it

In their research Holten and van Wijk compare the performance of directed edge representations in user studies to see which is the clearest:

Directed Edge Comparison

Tapered edges performed best, but the test displays connections in only one direction. At Boundary we need to show connections between nodes in both directions, as well as the strength of the connection.

Using tapered edges, both straight and curved, we think we’ve found a couple of ways to represent bidirectional edge connections that are both intuitive and beautiful. Straight edges seem to be a bit clearer, but the curved edges look gorgeous. What do you think?


Know a Delay: Nagle’s Algorithm and You.

Posted by on May 2nd, 2012

In summary: use NODELAY on TCP sockets when you’re sending many small, latency-critical messages.

TCP stacks are complicated beasts with all kinds of emergent behavior, grafted on through years of evolving hardware, transports, and workloads. Sometimes small settings can make huge improvements for certain kinds of socket patterns. Today I’d like to talk about Nagle’s algorithm, controlled by a socket flag called TCP_NODELAY.

Programs that send many small messages over a TCP socket can cause problems; since each packet requires a 20 byte IP and a 20 byte TCP header, the overhead can saturate the link rapidly. To save bandwidth, John Nagle proposed a method to aggregate multiple TCP writes into a single packet. In essence, Nagle’s algorithm accumulates outbound data until either:

1. The maximum segment size is reached, or
2. An ACK is received.

For small flows, this keeps the packet ratio at roughly 1:1–one packet sent for each one received. Transmission latency is capped by the round-trip time, and protocol overhead is significantly reduced. It may impose an unnecessary latency cost for larger writes, but for typical workloads it Does The Right Thing (TM).

Well, mostly. There’s another congestion control measure called TCP Delayed Acknowledgement, which allows the receiver of TCP data segments to hold off responding with an ACK for up to 500 ms, or every 2 packets–again, to reduce protocol overhead. This means that some types of TCP flows (in particular write-write-read patterns) can result in Nagle’s algorithm and Delayed Acknowledgement in a stalemate: the sender refuses to send the second packet until an ACK is received, but the receiver won’t ACK until more packets have arrived.

The results can be catastrophic.

Riemann‘s TCP protocol consists of sending a small (perhaps 30 bytes) Protocol Buffer event, then waiting (synchronously) for a small confirmation message. The simple Java client I wrote goes like this:

The benchmarks were so slow that I gave up waiting, and went home for the night. In the morning, I found:

Nagle's Algorithm - Latency

This kind of latency pattern is, in a way, *relieving*. An even 40ms for every operation suggests a buffer or timeout somewhere; a spiky pattern would suggest a CPU or IO-bound process. VisualVM confirmed the benchmark was spending almost all its time in the Netty server’s select(), which told me the client was failing to send packets the server wanted. Checking the packet times in Wireshark confirmed the latency issue. I disabled Nagle’s algorithm with socket.setTcpNoDelay(true); and boom:

Nodelay - Latency

That 40 ms latency disappeared instantly. Steady-state throughput rose from 25 events/sec to 4500 events/sec. Because of this protocol’s write pattern and need for synchronous low-latency small messages, disabling Nagle’s algorithm was the correct choice: dramatically improved latency and throughput with moderate bandwidth overhead.

Even if you aren’t writing your own network client, some applications allow you to configure their use of Nagle’s algorithm. Riak, for instance, offers disable_http_nagle, which can significantly improve latency for certain data volumes and network conditions. Moral of the story: TCP is tunable. It pays to know your socket flags.


Hungry Hungry Kobayashi, Part 2

Posted by on April 25th, 2012


In my last blog post I talked a little about debugging a problem with our distributed analytics database by looking at the network. For all of the gory details of the resolution and why distributed systems are hard, read on. If not, tl;dr is:

When debugging distributed systems behavior, looking at what the network is doing helps us begin our troubleshooting closer to the root cause.

Recall that we were seeing cleanup tasks of expired data in riak taking progressively longer and longer until the cluster became largely unresponsive to any other work. The next step was to take a closer look at these cleanup tasks and look at bolth the number of keys being cleaned up and individual latencies for deletion. The number of keys being scheduled for deletion each cleanup interval was much higher than expected and the latency of each delete operation was also growing steadily. Further investigation showed that manually querying riak’s built-in $key 2i via the REST API returned results for ranges of keys that had been previously deleted. Worse still, the number of keys in this range was wildly inconsistent on every invocation.

We first suspected this to be an issue with secondary indexes not respecting recent deletes and made some configuration changes to our cluster to remove tombstones immediately upon key deletion. This didn’t help. Digging deeper, we found the issue to be far upstream from riak itself and in the streaming system responsible for data cubing on relevant dimensions. As units of work moved around the cluster (from handoff during a deploy or load-correction for example) scheduled flushes of mutable state to riak were not being shut down. And because these leaked schedulers were not receiving any new data, they were stagnated and continued causing data to be written to riak which was then immediately scheduled for deletion. The additional write and cleanup load placed on riak by this scheduler leak is what eventually caused the cluster to become unresponsive.

What initially looked like a riak 2i problem turned out to be a scheduler leak far upstream from riak in our cluster of streaming nodes.

Once this issue was corrected, we began to see riak behavior improve dramatically. Our 99th percentile put latencies have dropped from 200-225 ms to under 10 ms–much closer to the performance we expected on SSDs.

Separate Incidents

You might also recall that, while cleanup tasks were executing every 10 minutes, every 6th (once per hour) one of these tasks would produce a much more noticable affect on the network. While fixing the scheduler leak in the streaming system presented us with a much happier riak, we still didn’t understand the variance in cleanup task behavior.

For context, let’s look again at the 1-second and 1-minute resolution view the network load generated by these cleanup tasks.

This was much simpler to address and was caused by bug in per-query clock advancement in kobayashi itself. After resolving this issue and looking again at the network, we see behavior that conforms to our mental model of how these cleanup tasks work–roughly congruent patterns that repeat every 10 minutes.

In Summary

This series of blog posts was motivated by observing network behavior that clashed with our mental model of distributed system behavior. We hypothesized that there was a link between this mismatch and degraded performance of the system over time. By starting with the network, we were able to tackle these problems closer to their root cause rather than at the trailing indicators–which in this case would have simply been poor riak performance and eventual rejected writes.


Page 3 of 7 Newer Posts Older Posts