Recently on one of our servers we saw that the 10G solarflare NIC card was dropping packets. We could also detect the same within our application. We had client application listening to video streams send over the network via multicast traffic. Things were all good and green untill we increased the channels being subscribed by our client application.
We went around troubleshooting the problem in the following fashion:
Well nothing seemed to help. Then we reached out to the support guys at solarflare. And the guys there do an amazing job. We provided them a onload_stackdump with the running application.
And let me tell you it was huge (~ xx lines).
After analyzing they told us that the application read buffers were not sufficient to handle the input stream data packet, so why not increase the no of rx ring buffer size. Well I told them that we had already done that via the ethtool -g <> rx 4096 and that it did not work.
And the big surprise was while applications are running in kernel bypass mode, those settings dont come into picture. Instead they asked us to set rx_packets to 35999 MAX (out of which around 75% is used for rx buffers packets and rest 25% for tx packets).
We went around troubleshooting the problem in the following fashion:
- Check that the application buffers were not the bottleneck while reading the packets
- Check that the NIC network rx buffer were set to maximum
- Ascertain that traffic on the network was within the limit of what the network lan could handle.
- Loading onload drivers with different pre-defined profiles (like latency, etc).
Well nothing seemed to help. Then we reached out to the support guys at solarflare. And the guys there do an amazing job. We provided them a onload_stackdump with the running application.
And let me tell you it was huge (~ xx lines).
After analyzing they told us that the application read buffers were not sufficient to handle the input stream data packet, so why not increase the no of rx ring buffer size. Well I told them that we had already done that via the ethtool -g <> rx 4096 and that it did not work.
And the big surprise was while applications are running in kernel bypass mode, those settings dont come into picture. Instead they asked us to set rx_packets to 35999 MAX (out of which around 75% is used for rx buffers packets and rest 25% for tx packets).
Well that too didnt help solve the problem. Next we tried resetting the ring buffer size from the default 512 to 4096 (though they claimed its default 512 has been running for now 4+ years).
But I anyhow did it and YEAH.. I saw absolutely no packet drops since then.