avatar

A deep dive to find a nasty bug.

Posted by on April 4th, 2012

Intro

This post will describe a very hairy bug I discovered in some versions of the linux kernel as well as some parts of libpcap.

A curious observation

A customer was reporting that their Boundary graphs on certain Debian Lenny machines with bonded NICs in active-backup mode were showing graphs with no ingress traffic. Their setup was replicated on some hardware we have at the office and I began investigating.

First, I began to ping the machine from my laptop. Then on the machine itself I used tcpdump to sniff ICMP packets coming in on the bond interface :

% sudo tcpdump -i bond0 dst 172.16.209.136 and proto 1
12:57:26.275660 IP 172.16.209.1 > 172.16.209.136: ICMP echo request, id 62831, seq 54, length 64
12:57:27.275731 IP 172.16.209.1 > 172.16.209.136: ICMP echo request, id 62831, seq 55, length 64
^C
2 packets captured
2 packets received by filter
0 packets dropped by kernel

Everything looked fine. Time to try sniffing eth0, the active physical NIC on the bond:

% sudo tcpdump -i eth0 dst 172.16.209.136 and proto 1
^C
0 packets captured
2 packets received by filter
0 packets dropped by kernel

So, sniffing bond0 was revealing ICMP packets coming in, but sniffing the active physical NIC on the bond was showing nothing. This was why our graphs were showing no ingress traffic – the meter wasn’t getting any packets!

But, why?

Device agnosticism

To debug this problem, I began first by examining the device agnostic layer of the network stack to track down the code which hands incoming packets off to pcap. The device drivers call up to the device agnostic layer and hand it a blob of bytes pulled off the network by calling netif_receive_skb.

Take a look at this code (shortened for brevity) from netif_receive_skb in net/core/dev.c:

int netif_receive_skb(struct sk_buff *skb)
{
  /* ... */

  orig_dev = skb_bond(skb);

  if (!orig_dev)
    return NET_RX_DROP;

  /* ...  */

The function skb_bond determines if an skb came in from a device that is part of a bond. If so, the function ensures that the device the skb came in on is the active device on the bond. This check is performed to protect against delivering duplicates to the higher protocol layers for certain bonding configurations. If the skb passes those checks, its dev pointer is overwritten with a pointer to the bond device and a pointer to the original device is returned.

Conceptually, you can imagine the following approximately equivalent psuedo-code:

starting state:
  orig_dev = NULL
  skb->dev = "eth0"

orig_dev = skb->dev
if skb->dev is part of a bond:
  if skb->dev is the active device on the bond:
    skb->dev = bond

end state:
  orig_dev = "eth0"
  skb->dev = bond

OK, so the skb is getting adjusted to appear as if it was received on a bond device instead of a physical interface.

If we continue down the netif_receive_skb function we find the code which hands the skb over to pcap:

  list_for_each_entry_rcu(ptype, &ptype_all, list) {
    if (!ptype->dev || ptype->dev == skb->dev) {
      if (pt_prev)
        ret = deliver_skb(skb, pt_prev, orig_dev);

      pt_prev = ptype;
    }
  }

This code iterates over a list, ptype_all, which is the list containing pcap entries and determines if the device structure on the pcap entry matches the device the skb came in on.

The device check in that loop is interesting:

    if (!ptype->dev || ptype->dev == skb->dev)

If you are attempting to sniff packets on eth0, but eth0 is part of a bond this check will fail because your skb->dev has been overwritten to point to the bond device’s dev structure.

This must be why the meter, tcpdump, et al don’t see incoming packets when sniffing physical devices that are part of a bond!

I can simply change that if statement to:

    if (!ptype->dev || ptype->dev == skb->dev || ptype->dev == orig_dev) {

And then skbs with overwritten dev pointers will still get handed over to pcap because it will also check orig_dev.

Let’s test this fix.

A curious observation, round 2

So, I built and installed the modified kernel (this is a useful guide, by the way), started pinging the machine, and tried to sniff incoming packets on the physical device:

% sudo tcpdump -i eth0 dst 172.16.209.136 and proto 1
^C
0 packets captured
2 packets received by filter
0 packets dropped by kernel

Wat.

Why am I still not seeing incoming packets even after making the change above?

libpcap

Let’s briefly examine how libpcap interfaces with the AF_PACKET address family in the kernel.

AF_PACKET is implemented as a separate address family in the kernel and the code for this can be found in net/packet/af_packet.c. libpcap creates a socket by calling the socket system call, with the first argument set to AF_PACKET. libpcap then binds that socket to the device they want to sniff packets from using the bind system call.

It may now pull packets out of the kernel in one of two ways:

  • The “old way.” One call to recvfrom on the file descriptor per packet. This is the only method available on older kernels.
  • The “new way.” A call to poll which wakes up libpcap when a new set of packets are available to be read from a memory region shared between the kernel and libpcap. This method is much more efficient (many fewer system calls needed) than the “old way” and is supported on most recent kernels, including the kernel on our Debian Lenny machine.

It turns out that even though the kernel shipping with Debian Lenny has an AF_PACKET implementation that supports the “new way” of sharing packets between the kernel and userland, the version of libpcap that ships with Debian Lenny does not. This means that tcpdump (which relies on libpcap) is pulling packets out of the kernel one at a time.

Newer versions of libpcap default to using the “new way” of gathering packets from the kernel. Since the Lenny kernel supports that, I tried building a newer libpcap and linking tcpdump to it. When I test this on my modified Lenny kernel, I see packets flowing on the RX path when I sniff a physical device that is part of a bond. If I hack the new libpcap to use the “old way” of gathering packets, I see no packets flowing on the RX path.

That means that there is either a bug in the AF_PACKET implementation in the kernel on the “old way” path or a bug in libpcap on the “old way” path in multiple versions of libpcap.

the if statement

After many hours of reading code and head scratching, I tracked down a single if statement in libpcap’s “old way” code path for pulling packets out of the kernel.

From pcap_read_packet in pcap-linux.c:

if (handle->md.ifindex != -1 &&
    from.sll_ifindex != handle->md.ifindex)
  return 0;

This if statement is ensuring that the packets pulled out of the kernel have the same index in the network device array as the interface the user told libpcap to monitor. If the indices don’t match, pcap_read_packet returns without calling the callback supplied to libpcap.

This piece of code was added to protect against a race condition in some kernels where AF_PACKET would begin queuing packets up for all devices on the system in between the calls to socket and bind, i.e. after the socket is created but before it has been bound to a specific device.

HOWEVER, this check fails for packets that arrived on physical devices which are part of a bond device.

The user asked libpcap to monitor the physical device, but the kernel is overwriting the dev structure for incoming packets with a pointer to the bond device in netif_receive_skb as we saw above. The index of the bond device does not match the index of the physical device.

This if statement is why, even with a fixed kernel, incoming packets never make it to a monitoring application like tcpdump or the Boundary flow meter.

This check does not exist in the “new method” of pulling packets out of the kernel because kernels which support the new mmap method don’t have the race condition that this code is protecting against. This is why linking tcpdump against a newer libpcap (on a fixed kernel) allows you to see incoming packets on physical devices that are part of a bond.

This check still exists today in current versions of libpcap.

Conclusion

It is a miracle computers work.

avatar

Day one GA review for Boundary

Posted by on March 28th, 2012

The day started with some concern. We thought we had tested everything but some things slipped through the cracks. In our situation there was a corner case in our registration process which meant that a small percentage of people were delayed receiving their activation emails by a few hours. The “bug” was found (it was actually to do with our implementation of automation systems), corrected and emails released to those that did not received them.

There were a couple of other minor issues for instance some folks asked us about pricing and (even though we are waiting a short while before fully publishing pricing on the web site) there was a pricing FAQ page that didn’t quite make it to the final web push (this is also updated now).

But, apart from these minor items the GA launch day was a good one. By the end of the day there were over a 100 more people enjoying Boundary than there were just 24 hours earlier and, we had another company make the commitment to being a paying customer. I’m sure we will make the marketing announcement as soon as we have that company’s permission.

From a product perspective all went really well. The twitter comments which can be easily viewed by searching @boundary were very positive in terms of speed of setup “Took 3 minutes to setup a @boundary meter. 2 minutes and 30 seconds of that was the ec2 instance starting up”, the visibility of the data that Boundary is providing “…I’m watching network traffic in realtime. Ladies and gentlemen, your jetpack”, and the super cool user interface “Goddamn the @boundary UI is cool.”

We received some positive press coverage as well. A selection of the articles:

http://www.talkincloud.com/boundary-launches-saas-based-monitoring-for-big-data-and-cloud-applications/

http://www.enterpriseappstoday.com/data-management/meeting-challenge-of-monitoring-big-data-in-the-cloud.html

http://siliconangle.com/blog/2012/03/27/devops-dossier-boundary/

Our infrastructure and datacenter was great and even though someone at our provider decided to pull a cable on one of our production servers in the middle of the day…..customers didn’t notice a thing. That’s why you write resilient software, because people do dumb things.

Another real positive for me for the day was the level of direct interaction between our engineering team and new customers particularly on the IRC channel (#Boundary on Freenode). Constant discussion all day long with our engineers helping users directly with questions and comments.

So, what now? Well the “spike” that you get from launch interest is very short-lived and real life typically resumes as soon as day 2. Today we’re all business….need to gather the sales and marketing folks together to ensure that we’ve got our plans together for the next few weeks, the engineering team will need to get together this week to start researching and assessing the next waves of capabilities that we’re going to add to the solution, the customer support teams need to be sure that they are following up with all our users and making sure their Boundary experience is an exemplary one and of course….we need to ensure that we take some time to enjoy the achievement so far, of which beer will play a major role.

Thanks to everyone for their support so far – now we’ve got some work to do.

avatar

Boundary service moves to General Availability

Posted by on March 27th, 2012

After huge amounts of effort from everyone involved I am pleased to announce that the Boundary solution for monitoring Big Data application architectures is now Generally Available.

Over the last few months, we have had approx 60 customers beta testing Boundary and much of their feedback has found its way into the GA product or is scheduled for future delivery. Our documentation has been updated, lots of videos have been added, the support forums are up and running, huge amounts of capabilities have been added and the service is also now hosted at one of the largest SAS 70 Type II compliant, Tier III data centers in North America.

So what does it mean for us to be GA? Well the most noticeable difference is that customers can come directly to our web site and get access to a 14 day free trial. There are no sales people to talk to, no webinars to sit through, just give us your details and you’re free to get started. We will be in contact with you during your trial period by whatever means you prefer (chat, phone, email etc) to offer assistance and get your feedback but you should also feel free to reach out to us with comments, questions or suggestions for future capabilities.

At this point, you cannot buy our product via the web and we don’t yet publish our pricing online, but both of these are temporary measures while we get feedback from our initial set of customers. In the future we plan to allow our customers to sign up, browse pricing plans, buy and get support all online.

Today marks a water-shed day for the entire IT monitoring business. We believe that this is the first time that any product has attempted to operate by collecting “all the data all the time” (instead of being restricted by old-fashioned data sampling techniques) and to process that data in real-time to give customers second by second updates. The potential for this architecture and the technology that we have built is enormous; one of our greatest challenges is sure to be deciding which areas of future capability not to build….already we have more ideas that we have time to build them!

So, it remains for me to say a huge thank you to everyone on the Boundary team. Even though I’ve only been a part of this company for a few months, I have been incredibly impressed by our team and how much we can achieve in such a short space of time; I look forward to sharing in the future with you.

And biggest thanks of all of course are reserved for our beta testers and our first set of paying customers. Boundary exists for you and we encourage you to keep giving us feedback, keep telling us your likes and dislikes, keep sharing with us your challenges and of course, how you solve those challenges with Boundary.

I realize that we’re still a small company and many of the incumbents will arrogantly try to dismiss us but we know what the future holds and with your help we will get there.

#monitoringthatdoesnotsuck

avatar

Customer email received…check it out

Posted by on March 23rd, 2012

Received this today from a Boundary customer. Very Cool

We already feel strongly about our return of investment on boundary

Thank You :)   I’ve wanted a product like Boundary for years

FWIW this will save my client likely 15k a month or more…just to be completely clear on how much value you give…

I really can’t say enough thank yous

Page 9 of 16 Newer Posts Older Posts