Streaming Failure Post-Mortem 7/31

Posted by on August 1st, 2012

From approximately 8:00 AM PDT to 2:00 PM PDT Boundary experienced intermittent downtime in the streaming system that resulted in the inability to accept data from customer meters and blank dashboards. The high level overview is that a sharp spike in ingress customer data combined with a suboptimal load balancing strategy resulted in a set of cascading failures that proved extremely difficult from which to recover.

At around 8:00 AM PDT we noticed a sharp uptick in customer traffic, approximately 60mbps worth all added at once. We use global server load balancing (GSLB) as the primary load balancing / disaster recovery facility for meters (customer installed agents) connecting to our collectors, the main point for ingestion of customer data. Unfortunately, when a large number of meters come up all at once DNS caching on the customer’s network will ensure that all of these meters connect to the same collector machine. In this particular case, that one collector was running near its capacity and the extra load was enough to knock it offline.

The resulting spillover load when combined with the uneven load balancing resulted in a thundering herd situation. Collectors would start, operate for a few minutes, then buckle under the load.

Meanwhile, the instability and occasional unresponsiveness of the collectors was causing Phloem, our Kafka powered messaging tier, to start having problems. The Phloems were doing blocking connection attempts in their main data consumption threads. When the collectors started becoming unresponsive, these connection attempts starved out the data consumption thread pool. This backed up the high volume of data waiting to be processed in the Phloem instance, causing OOM exceptions and necessitated killing the instances without a clean shutdown.

On restart, the Phloems brought up Kafka in recovery mode which means a scan of all files on disk. Due to our particular use of Kafka with many topics this expensive recovery operation caused restart times take 20 to 60 minutes. At no time were all Phloems down and thanks to Ordasity load shifted as need be. However, due to collector connection timeouts in the work claiming cycle load shifted much slower than normal also causing occasional downtime for some clients.

Specific Fixes

We implemented some bandaids to get the system back under control immediately. A set of HA Proxy instances were deployed in front of the collectors, providing a more even balance of load across the cluster. We also cleaned up a number of obsolete topics in our Kafka instances, helping with the restart time. Moving the connection code out of the main data consumption path, tuning timeouts and retry rates were all done in order to help with robustness in the Phloems while the collectors recovered.

Longer term fixes include implementing more intelligence around connection logic in the meter. The meter should do some form of back-off when the collectors do not respond, so as to prevent the thundering herd issues seen when collector recovering happens. The meters should also be able to intelligently assign themselves to a collector such that a single customer’s traffic does not all end up on a single collector. This may take the form of consistent hashing against the DNS records for our collectors.

Internally, in the interaction between the collectors and Phloem we need to completely separate data plane from control plane. Separating the control plane and data plane is a standard practice for many high throughput systems and it is an architecture that we plan on moving towards for our internal message flow.


Boundary Tech Talks – June 26th

Posted by on July 2nd, 2012

Another evening of tech talks, and another record turnout, despite all of the conference activity going on in the Bay Area last week.   A big thanks goes out to our speakers, both of whom gave excellent talks.

Jamie Allen talks about CPU caches.

Joe Damato takes us on a deep dive to find An Bug.


Fauna – Where’s the Drop?

Posted by on June 11th, 2012

In order to understand this fauna you will first need to know a little bit about Boundary’s streaming system.  Customers connect their meters up to a set of hosts that we refer to internally as the collectors.  The collectors terminate the connections from the meters and expose them internally as an array of pubsub endpoints, one per customer.

We track the health of many of our components in the streaming system by comparing relative traffic levels of input and output.  For instance, if we are receiving 100 mbps of ingress from our customers, the collector egress should be somewhere in the vicinity of 440 mbps.  In other words, the relation of ingress to egress ought to be a 4.4x multiplier. So imagine our surprise when we saw this.

This graph tells us that the collectors are running at a 16x multiplier.  What was most interesting about this finding was what happened when we restarted a collector.  We found that the ratio would drop back to normal.  Therefore, this is something that builds up over a long period of time due to some sort of leak in the code.  Our prime suspect for finding the leak was the pubsub system in the collectors.  In particular, the theory was that pubsub handlers weren’t getting cleaned up when the receiving side went away.  After some spelunking in the code we found what we believed was the issue.

@@ -38,6 +38,7 @@ drain(Pid) ->
 %% gen_server callbacks
 init([OrgMatch, RemotePid, Version]) ->
+ process_flag(trap_exit, true),
  timer:send_interval(1000, self(), export),
  {ok, #state{pid=RemotePid,org=OrgMatch,version=Version}}.

@@ -57,7 +58,9 @@ handle_info({packet, OrgId, _MeterId, Packet}, State=#state{pid=Pid,version=1,or
 send_packets(1, Pid, OrgId, [Packet]),
 {noreply, State};
 handle_info({packet, OrgId, MeterId, Packet}, State=#state{version=Version}) ->
- {noreply, apply_packet(translate_packet(Version, MeterId, OrgId, Packet), State)}.
+ {noreply, apply_packet(translate_packet(Version, MeterId, OrgId, Packet), State)};
+handle_info({'EXIT',_,_}, State) ->

The issue pointed out in this diff is a subtlety in the way that Erlang handles link breakages.  Link breakages get handled via an exit signal, which is an out of band message that gets handled by the Erlang runtime system.  When the exit reason is “normal” the runtime system will not kill the receiving process.  The fix, therefore, is to tell the runtime that we are interested in all error messages.  That’s what the like “process_flag(trap_exit, true)” does.  The rest is just message handling to explicitly stop the process when a link break happens.  With the code fixed, we proceeded to do a deploy.  The new behavior would be loaded up via a hot code upgrade, however in order to clean out the leaked subscriptions we were going to do a rolling restart of the collectors.

The deploy started fine, but then we started to see a huge egress spike on collector 6.  After a bit of investigation we found out that this was due to the load balancing of connections across the collectors.  As collectors were restarted, connections got shifted onto other ones.  It was so pronounced on collector 6 because it was still sending so much data over the leaked subscriptions.  We proceeded with the rolling restart of the collectors, finally clearing out all the leaked subscriptions.

What are the lessons here?  As usual, the most interesting bugs only show up after months of production use.  The other takeaway is that traffic ratios for services are a powerful way for us to express and reason about aggregate behavior in heterogeneous distributed applications.  Keep an eye out for some new product capabilities based around this concept in the near future.


May Tech Talk Videos

Posted by on May 29th, 2012

dropped knowledge

A big thanks to our May tech talk speakers Jeff Hodges and Dietrich Featherston.  This meetup had the highest turn out so far, so thanks to everyone who showed up and made it a great evening.  Without further ado, here are the videos.


Fauna – Zookeeper Trumpets

Posted by on April 13th, 2012

We use Zookeeper a lot at Boundary.  It has become the glue for assigning work to our streaming system, managing the lifecycle of customer workloads and giving us a touchpoint to configure and operate our distributed services.  It would not be hyperbole to say that we rely on the uptime and proper function of our Zookeeper cluster.

Since we are so reliant on Zookeeper, we have set up a number of flow profiles and alerts on our zookeeper traffic.  The flow profiles provide custom visualizations of flows from our zookeeper cluster to the various services that consume from them, and the alerts will let us know when that traffic goes out of bounds.  When we setup these alerts, we assumed that Zookeeper traffic should not go up over a few megabits per second.

So imagine our surprise when we got an alert for a 600mbps spike in Zookeeper client traffic.  When we looked at the graph, we saw something that looked like the graph at the top.  Extremely curious behavior.  Especially interesting is what appears to be a very smooth inversely proportional relationship between traffic levels and time relative to the event which ultimately generated this phenomena.

Drilling down a bit more, we found that this particular traffic was coming from our collector tier.  Our collectors use zookeeper to monitor when new customers come online.  After instrumenting the collector code using folsom we found that they were doing 6 ls watch operations against the Zookeeper cluster per second.

Here’s the diff that wound up fixing the issue:

 handle_info({new_org, _}, State=#state{orgs=Set}) ->
     case zk_client:watch_orgs(self()) of
 	disabled ->
 	    {noreply, State};
 	{ok, NewOrgs} ->
-	    zk_client:watch_orgs(self()),
 	    process_new_orgs(NewOrgs, Set),
 	    {noreply, State}

For those of you who do not know Erlang, this was doing 2 watch subscriptions every time a watch event fired. So as time went on, the collectors would increase their watches in an exponential progression. The smooth and inversely proportional relationship is actually showing the Erlang message queue draining as the watch events are processed.

So what’s the takeaway here?  Code that may appear to work correctly, may in fact be broken, and some issues only show up after days/weeks/months in production.  If we weren’t looking at our network with Boundary, then our first notification of this problem would have been the overloading and eventual failure of our Zookeeper cluster.  How many smoldering bugs are sitting in your environment, waiting to become 5 alarm fires?

Page 1 of 3 Older Posts