From approximately 8:00 AM PDT to 2:00 PM PDT Boundary experienced intermittent downtime in the streaming system that resulted in the inability to accept data from customer meters and blank dashboards. The high level overview is that a sharp spike in ingress customer data combined with a suboptimal load balancing strategy resulted in a set of cascading failures that proved extremely difficult from which to recover.
At around 8:00 AM PDT we noticed a sharp uptick in customer traffic, approximately 60mbps worth all added at once. We use global server load balancing (GSLB) as the primary load balancing / disaster recovery facility for meters (customer installed agents) connecting to our collectors, the main point for ingestion of customer data. Unfortunately, when a large number of meters come up all at once DNS caching on the customer’s network will ensure that all of these meters connect to the same collector machine. In this particular case, that one collector was running near its capacity and the extra load was enough to knock it offline.
The resulting spillover load when combined with the uneven load balancing resulted in a thundering herd situation. Collectors would start, operate for a few minutes, then buckle under the load.
Meanwhile, the instability and occasional unresponsiveness of the collectors was causing Phloem, our Kafka powered messaging tier, to start having problems. The Phloems were doing blocking connection attempts in their main data consumption threads. When the collectors started becoming unresponsive, these connection attempts starved out the data consumption thread pool. This backed up the high volume of data waiting to be processed in the Phloem instance, causing OOM exceptions and necessitated killing the instances without a clean shutdown.
On restart, the Phloems brought up Kafka in recovery mode which means a scan of all files on disk. Due to our particular use of Kafka with many topics this expensive recovery operation caused restart times take 20 to 60 minutes. At no time were all Phloems down and thanks to Ordasity load shifted as need be. However, due to collector connection timeouts in the work claiming cycle load shifted much slower than normal also causing occasional downtime for some clients.
We implemented some bandaids to get the system back under control immediately. A set of HA Proxy instances were deployed in front of the collectors, providing a more even balance of load across the cluster. We also cleaned up a number of obsolete topics in our Kafka instances, helping with the restart time. Moving the connection code out of the main data consumption path, tuning timeouts and retry rates were all done in order to help with robustness in the Phloems while the collectors recovered.
Longer term fixes include implementing more intelligence around connection logic in the meter. The meter should do some form of back-off when the collectors do not respond, so as to prevent the thundering herd issues seen when collector recovering happens. The meters should also be able to intelligently assign themselves to a collector such that a single customer’s traffic does not all end up on a single collector. This may take the form of consistent hashing against the DNS records for our collectors.
Internally, in the interaction between the collectors and Phloem we need to completely separate data plane from control plane. Separating the control plane and data plane is a standard practice for many high throughput systems and it is an architecture that we plan on moving towards for our internal message flow.