The Page and the Log
A long time ago, in a time before documentation and annotations, someone added a monitor of nf_conntrack to our standard server configuration profile. I was paged for this recently, with a clear-as-mud message of
WARNING: check_conntrack-someservername. Dropping into the server and looking at
/var/log/messages revealed hundreds if not thousands of these messages:
kernel: nf_conntrack: table full, dropping packet.
I had no idea what this
nf_conntrack was, nor exactly what it meant or what could cause it. It was a blind spot to me. It was a blind spot to most of my peers, too. Here's what I learned.
nf prefix of nf_conntrack refers to the Netfilter project. This is the suite of tools that includes iptables in Linux. More specifically, nf_conntrack is part of the conntrack-tools subsystem. It is a management system for in-kernel connection tracking and statistics collection. I'll refer to it as conntrack from now on.
What is nf_conntrack's function and behavior?
Conntrack does exactly as it sounds: it tracks connections over their lifetime. It stores the state of every connection, usually in a file at
/proc/net/nf_conntrack. By default the tool allows up to 65536 connections in its table. When this limit is reached, new connections are dropped until there is room.
Conntrack also has a bit of memory about closed connections, and tracks them for a short time after they close. Conntrack uses a default timeout value of 120 seconds for tracked connections. The connection tracking table won't just hold established connections, but it will hold passively closed connections too. This means that it will track a connection for two full minutes after closing before it falls off.
Why the monitor and the page?
Of course, a server suddenly dropping traffic is a Bad Thing. The log line above indicated that the server started displaying this behavior. Had I been unlucky, my ssh connection to the server may have also been dropped. Someone was probably bitten by this in the past and added this check as an actionable alert.
What was opening so many connections?
The application server that encountered this problem was a fairly usual setup. I can't get into too much detail, but it was a Ruby On Rails application running behind a forking webserver. It used a Redis-backed Sidekiq instance with a few workers and handled a few thousand jobs a minute at peak.
It's important to note that this alert came when traffic and jobs spiked from an external source. The traffic and queue of jobs had increased by roughly 4x what it normally sees for a very short period of time. The app had run at this level several times before without this alert triggering, but each other time there was a slow ramp-up. I was also able to examine the connections being tracked, and saw an unusual number of connections to Redis during all times, so the oddness was occurrring during normal operations.
With the knowledge that the influx is what triggered the alert, I had a starting point. I narrowed it down to something either in the forking of the webserver or when new job processes were started. With quite a bit of testing, which I can't share here, I replicated the applications setup with a connection pool to Redis. The background processes forked and used threads for concurrency, and I discovered that the connection pool was not thread-safe and was therefore not limiting the connections to Redis.
I was able to replicate the excessive number of connections. The fix was to replace the connection pool with something that was thread-safe, as changing the concurrency model was far and away a bigger lift with a lot of other changes required. I was able to learn or refresh in my memory a few other concepts that I'd like to cover.
Mapping Conntrack Connections to PIDs
Reading the conntrack file can be difficult. This answer on stackoverflow was by far the easiest explanation of the entries that I found. Using the connection's local port, I was able to map the tracked connection from local processes to a remote Redis instance.
As I was experimenting with threads and the connection pool, I needed to know how many threads were started and how many connections were open from the local box to the Redis server. I accomplished this with
netstat -penta 2> /dev/null | grep 6379 -c, which gave the total number of connections. I could see the number of processes and its threads with something like
ps umax -L, and my testing script also kept track of this. I was able to compare the number of connections I was expecting to have with the number of actual connections as long as the test completed within conntrack's timeout.
One confusing aspect was the
SPID column in ps. I believe it stands for sub-PID, but I can really only find references to SQL Server's "server process ID" save for an old document. Even the man page for ps only says
-T Show threads, possibly with SPID column
without explanation of what "SPID" means.
What is TIME_WAIT?
The conntrack file listed a huge number of connections in the
TIME_WAIT status on the affected server. There are implications and affects to this, but it more or less indicated that something wasn't pooling connections quite correctly. This serverfault question and answer summarized it nicely, and the answer leads to a good explanation of TIME_WAIT that's mostly focused on Windows.
This stackoverflow question and answer explains what happens in Linux pretty well. This particular answer about tying up a port in Linux may be the reason the connection count balloons so quickly. I will have to follow up on this in a future post.