My team recently ran into some issues with a new instance of a tool, and I want to go over what we found. I won't name the tool here, since it doesn't really change what was learned. It could happen to anything that speaks over the network.
The service was a complex third-party server with many components. I nor any of my team were experts in all parts of this service. It was distributed and running on VMs in Azure.
We deployed this tool after testing and validation by building a new instance in our production environment. This went smoothly. The next day? The next day didn't start smoothly.
We jumped online in the morning to continue setting up the tool. It took something like thirty minutes to log into it from its web interface. That wasn't a good sign. There was no load on the large machines, no performance warnings, excess memory, nothing going on with disks or backups or Azure. After the hiccup, the tool was working, but it certainly wasn't snappy. We couldn't pinpoint a reason for the slowdowns, so we continued digging and engaged the product's support.
Over a few days, we came to understand that the service and the machines it ran on never improved. The service crawled at times without reason but almost always first thing in the morning. Some bash commands at seemingly random times took an extra few seconds to fire. We struggled to find the source of these issues, and we the called in support was stumped.
Through a day of trial and error and exploration, we discovered two patterns.
- Any sudo'd command could be slow, but the slowness was generally on the first use within some time. It then could be fast for a short time after that first, slow invocation.
- The service and sudo commands were slow first thing in the morning for the first user, always.
We did some digging with this information, and the results blew my mind. It was an issue with DNS. DNS resolution for the hostnames of these machines was really slow, especially on the first lookup in the morning. DNS lookups with Azure's resolvers can be extremely slow without obvious reason. Since this was just a tiny amount of traffic and it was almost entirely waiting for 10 to 20 seconds, we couldn't see it in our metrics. Most services don't report DNS resolution times, and this tool was no exception. But why was the service and sudo'd commands so impacted?
First we'll cover the service. It stored a short name for all of the machines within its cluster. Each time it communicated with one of its many internal services, it usually had to look it up in a service discovery mechanism. Guess what happens when DNS was slow? The whole service suffered.
On to sudo. The sudoers file can contain rules based on the hostname of the machine. I suppose this is for managing a huge number of machines and using the same file, but I don't know anyone who has ever used that feature. These are controlled by the
Host_Alias directives. Surprisingly, even without either of the directives in use, sudo finds the hostname of the machine it runs on via DNS. That made sudo slow on commands when the DNS cache was empty, i.e. after the Time To Live.
We added localhost redirects for the machine names to the
/etc/hosts file, and it was like flipping a switch. There was an immediate improvement in the tool and in all sudo commands no matter the usage or time of day. Looking back, we had to change the hostname of these machines as part of the provisioning process, and that simple change was the root cause of this issue.
On a related note, use Fully Qualified Domain Names. You have much better control over name resolution and conflicts.