From approximately 12:50 AM to 6:00 AM PDT on July 31, Boundary experienced intermittent downtime of streaming and historical systems stemming from riak cluster instability. The tl;dr is that a configuration change intended to allow our riak nodes more breathing room with regard to open file limits caused intermittent ulimit kills. This partial failure cascaded to our streaming system and exposed a riak bug in which a cluster can become fragmented.
We use eleveldb as the storage backend in the riak cluster behind Kobayashi (our system for collecting and analyzing historical metrics). It is common for us to adjust the configuration of this cluster to better suit our workload. At around 12:30 AM PDT we rolled out a configuration change to the eleveldb backend upping its open file limit. The change was rolled out to one node at a time on our 9 node cluster. Immediately following this change we noticed 99th percentile get and put latencies jump a little, but have found this to be common following rolling restarts of the cluster. Soon thereafter we saw the first sign that there might be trouble–the riak-java-client within kobayashi (our application layer over riak for historical analytics) began intermittently losing connections to riak on localhost. At this time, latencies began peaking at 60 seconds for both gets and puts–the maximum allowed by riak_kv_put_fsm and riak_kv_get_fsm.
To make matters worse, this is around the same time we saw indications that riak cluster instability had cascaded, causing partial failures in the streaming system. A bit of a back-story in how our streaming and historical systems interoperate is helpful in understanding this failure mode. Kobayashi persists historical network metrics in riak in blocks of 10 seconds, 10 minutes, and 10 hours for second-resolution, minute-resolution, and hour-resolution metrics, respectively. Upon taking on a particular organization as a unit of work, the streaming system loads state–the last block written for each query–so that we can begin writing into the last partially complete block. Since the streaming system was unable to load this state from kobayashi/riak during normal handoff operations, some orgs failed to start up and process data. The result of this failure is a dashboard which fails to show streaming metrics. Furthermore, writes to kobayashi from the streaming system during this time were timing out causing bounded queues to back up and begin rejecting work. The end result of this additional failure mode is loss of some writes of historical data for many customers even if the dashboard appeared fully functional.
In the mean time we began digging deeper to get details on the overall health of the riak cluster and anything we could do to ameliorate the pain we were feeling. We noticed two things immediately:
- There were a large number of handoffs/fallbacks pending as per
- Cluster members did not agree on vnode ownership as indicated by `riak-admin ringready`.
As for the first item, we have seen pending handoff operations coincide with similar latency issues in the past. Unfortunately, the only solution has been to wait. However, we were troubled at the lack of a discernible downward trend in pending handoff operations over time. This fact combined with the split-brain that had occurred (point #2) led us to question whether we should revert the configuration change that preceded this failure or whether cluster instability was simply a snowball effect initiated by the rolling restart.
Further investigation into the riak logs showed that the parent process was periodically being killed due to ulimit issues. The configuration change we made allowing eleveldb more open files permitted the erlang vm as a whole to attempt opening more file descriptors than allowed by ulimit -n. This was made worse by vnode multiplexing and handoff-induced vnode motion around the cluster. Reverting the eleveldb change allowed the cluster to eventually stabilize.
However, riak was still fragmented as one node of the nine had formed a ring of its own. This turned out to be caused by a race condition in riak itself in which it becomes possible for a cluster member to leave and form a ring of one. While the window of time in which this bug may be exposed is exceptionally small, it is a condition that arises on node startup. Because nodes were dying and coming back online somewhat regularly, we had a greater probability of hitting that race condition–which we did. The folks at Basho were great and helped us troubleshoot and rejoin this orphaned node to the rest of the cluster.
There are some lessons to be learned here. In addition, we can extrapolate from these experiences about tooling that would have helped us to recover from this problem more quickly or prevented it entirely.
Our primary method of solving this problem was grepping through log files and using riak administration tools to get instantaneous snapshots of cluster state. A helpful tool for determining handoff activity contributing to cluster instability is `riak-admin transfers`. While this is useful, it doesn’t give the whole picture–we have to dig through log files for clues. Worse still, this particular tool blocks the very handoff operations being monitored so it must be used sparingly.
With a large number of machines and applications in play, it can be difficult to find important, low-frequency log events in the midst of unimportant, high-frequency ones. Tools that help consolidate and visualize logs emitted by distributed systems would potentially be of great help in these situations and we will continue looking into them.
Furthermore, we will take greater care to roll out configuration changes like this in the future. While we did take our time during the rolling update, we now have a better idea of how long to wait between riak restarts and how to evaluate whether a change was successful.
The good news is that most data was still captured during this time and only a subset of customers were affected by blank dashboards. But events like this remind us of the importance of resilient, operations-friendly engineering and help us to build better systems. While the source of this incident was a seemingly minor configuration change, the results had unexpected, cascading impact. We are forced to reason through these failure scenarios and this makes us better-equipped to accommodate partial failures in the future. Our mental model of how our infrastructure fits together, and how it breaks, is enhanced. We will continue using this experience to the benefit of everyone who uses Boundary.