Why Your Server Monitoring (Still) Sucks

Five observations about why your your server monitoring still
stinks by a monitoring specialist-turned-consultant.

Early in my career, I was responsible for managing a large fleet of
printers across a large campus. We’re talking several hundred networked
printers. It often required a 10- or 15-minute walk to get to
some of those printers physically, and many were used only sporadically. I
didn’t
always know what was happening until I arrived, so it was anyone’s
guess as to the problem. Simple paper jam? Driver issue? Printer currently
on fire? I found out only after the long walk. Making this even more
frustrating for everyone was that, thanks to the infrequent use of some of
them, a printer with a problem might go unnoticed for weeks, making itself
known only when someone tried to print with it.

Finally, it occurred to me: wouldn’t it be nice if I knew about the problem
and the cause before someone called me? I found my first monitoring tool
that day, and I was absolutely hooked.

Since then, I’ve helped numerous people overhaul their monitoring
systems. In doing so, I noticed the same challenges repeat themselves regularly. If
you’re responsible for managing the systems at your organization, read
on; I have much advice to dispense.

So, without further ado, here are my top five reasons why your monitoring
is crap and what you can do about it.

1. You’re Using Antiquated Tools

By far, the most common reason for monitoring being screwed up is a
reliance on antiquated tools. You know that’s your issue when you spend
more time working around the warts of your monitoring tools or when
you’ve got a bunch of custom code to get around some major missing
functionality. But the bottom line is that you spend more time trying to
fix the almost-working tools than just getting on with your job.

The problem with using antiquated tools and methodologies is that
you’re just making it harder for yourself. I suppose it’s certainly
possible to dig a hole with a rusty spoon, but wouldn’t you prefer to use a
shovel?

Great tools are invisible. They make you more effective, and the job is
easier to accomplish. When you have great tools, you don’t even notice
them.

Maybe you don’t describe your monitoring tools as “easy to use”
or “invisible”. The words you might opt to use would make my editor
break out a red pen.

This checklist can help you determine if you’re screwing yourself.

Are you using Nagios or a Nagios derivative to monitor
elastic/ephemeral infrastructure?
Is there a manual step in your deployment process for a human to “Add
$thing to monitoring”?
How many post-mortems contained an action item such as, “We
weren’t monitoring $thing”?
Do you have a cron job that tails a log file and sends an email via
sendmail?
Do you have a syslog server to which all your systems forward their
logs…never to be seen again?
Do you collect system metrics only every five metrics (or even less
often)?

If you answered yes to any of those, you are relying on bad, old-school
tooling. My condolences.

The good news is your situation isn’t permanent. With a little work, you
can fix it.

If you’re ready to change, that is.

It is somewhat amusing (or depressing?) that we in Ops so readily replace
entire stacks, redesign deployments over a week, replace configuration
management tools and introduce modern technologies, such as Docker and
serverless—all without any significant vetting period.

Yet, changing a monitoring platform is verboten. What gives?

I think the answer lies in the reality of the state of monitoring at many
companies. Things are pretty bad. They’re messy, inconsistent in
configuration, lack a coherent strategy, have inadequate automation…but
it’s all built on the tools we know. We know their failure modes; we know
their warts.

For example, the industry has spent years and a staggering amount of
development hours bolting things onto Nagios to make it more palatable
(such as
nagios-herald, NagiosQL, OMD), instead of asking, “Are we throwing
good money after bad?”

The answer is yes. Yes we are.

Not to pick on Nagios—okay, yes, I’m going to pick on Nagios. Every change
to the Nagios config, such as adding or removing a host, requires a config
reload. In an infrastructure relying on ephemeral systems, such as
containers, the entire infrastructure may turn over every few minutes. If
you have two dozen containers churning every 15 minutes, it’s possible that
Nagios is reloading its config more than once a minute. That’s insane.

And what about your metrics? The old way to decide whether something was broken
was to check the current value of a check output against a threshold. That
clearly results in some false alarms, so we added the ability to fire
an alert only if N number of consecutive checks violated the threshold. That has
a pretty glaring problem too. If you get your data every minute, you may
not know of a problem until 3–5 minutes after it’s happened. If you’re
getting your data every five minutes, it’s even worse.

And while I’m on my soapbox, let’s talk about automation. I remember back
when I was responsible for a dozen servers. It was a big day when I spun up
server #13. These sorts of things happened only every few months. Adding my
new server to my monitoring tools was, of course, on my checklist, and it
certainly took more than a few minutes to do.

But the world of tech isn’t like that anymore. Just this morning, a
client’s infrastructure spun up a dozen new instances and spun down
half of them an hour later. I knew it happened only after the fact. The
monitoring systems knew about the events within seconds, and they adjusted
accordingly.

The tech world has changed dramatically in the past five years. Our beloved
tools of choice haven’t quite kept pace. Monitoring must be 100% automated,
both in registering new instances and services, and in de-registering them
all when they go away. Gone are the days when you can deal with a 5 (or
15!) minute delay in knowing something went wrong; many of the top
companies know within seconds that’s something isn’t right.

Continuing to rely on methodologies and tools from the old days, no matter
how much you enjoy them and know their travails, is holding you back from
giant leaps forward in your monitoring.

The bad old days of trying to pick between three equally terrible
monitoring tools are long over. You owe it to yourself and your company to
at least consider modern tooling—whether it’s SaaS or self-hosted
solutions.

2. You’re Chasing “the New Hotness”

At the other end of the spectrum is an affinity for new-and-exciting tools.
Companies like Netflix and Facebook publish some really cool stuff, sure.
But that doesn’t necessarily mean you should be using it.

Here’s the problem: you are (probably) not Facebook, Netflix, Google or
any of the other huge tech companies everyone looks up to. Cargo
culting
never made anything better.

Adopting someone else’s tools or strategy because they’re successful with
them misses the crucial reasons of why it works for them.

The tools don’t make an organization successful. The organization is
successful because of how its members think. Its approaches, beliefs,
people and strategy led the organization to create those tools. Its
success stems from something much deeper than, “We wrote our own monitoring
platform.”

To approach the same sort of success the industry titans are having, you
have to go deeper. What do they do know that you don’t? What are they
doing, thinking, saying, believing that you aren’t?

Having been on the inside of many of those companies, I’ll let you in on
the secret: they’re good at the fundamentals. Really good. Mind-blowingly
good.

At first glance, this seems unrelated, but allow me to quote John Gall,
famed systems theorist:

A complex system that works is invariably found to have evolved
from a simple system that worked. A complex system designed from scratch
never works and cannot be patched up to make it work. You have to start
over, beginning with a working simple system.

Dr. Gall quite astutely points out the futility of adopting other people’s
tools wholesale. Those tools evolved from simple systems to suit the needs
of that organization and culture. Dropping such a complex system into
another organization or culture may not yield favorable results, simply
because you’re attempting to shortcut the hard work of evolving a simple
system.

So, you want the same success as the veritable titans of industry? The
answer is straightforward: start simple. Improve over time. Be patient.

3. You’re Unnecessarily Afraid of “Vendor Lock-in”

If there’s one argument I wish would die, it’s the one where people opine
about wanting to “avoid vendor lock-in”. That argument is utter hogwash.

What is “vendor lock-in”, anyway? It’s the notion that if you were to go
all-in on a particular vendor’s product, it would become prohibitively
difficult or expensive to change. Keurig’s K-cups are a famous example of
vendor lock-in. They can be used only with a Keurig coffee machine, and
a Keurig coffee machine accepts only the proprietary Keurig K-cups. By
buying a Keurig, you’re locked into the Keurig ecosystem.

Thus, if I were worried about being locked in to the Keurig ecosystem, I’d
just avoid buying a Keurig machine. Easy.

If I’m worried about vendor lock-in with, say, my server infrastructure,
what do I do? Roll out both Dell and HP servers together? That seems like a
really dumb idea. It makes my job way more difficult. I’d have to build to
the lowest common denominator of each product and ignore any
product-specific features, including the innovations that make a product
appealing. This ostensibly would allow me to avoid being locked in to one
vendor and keep any switching costs low, but it also means I’ve got a
solution that only half works and is a nightmare to manage at any sort of
scale. (Have you ever tried to build tools to manage and automate both
iDRAC and IPMI? You really don’t want to.)

In particular, you don’t get to take advantage of a product’s
unique features. By trying to avoid vendor lock-in, you end up with a
“solution” that ignores any advanced functionality.

When it comes to monitoring products, this is even worse. Composability and
interoperability are a core tenet of most products available to you. The
state of monitoring solutions today favors a high degree of interoperability
and open APIs. Yes, a single vendor may have all of your data, but it’s
often trivial to move that same data to another vendor without a major loss
of functionality.

One particular problem with this whole vendor lock-in argument is that it’s
often used as an excuse to not buy SaaS or commercial, proprietary
applications. The perception is that by using only self-hosted, open-source
products, you gain more freedom.

That assumption is wrong. You haven’t gained more freedom or avoided vendor
lock-in at all. You’ve traded one vendor for another.

By opting to do it all yourself (usually poorly), you effectively become
your own vendor—a less experienced, more overworked vendor. The chances
you would design, build, maintain and improve a monitoring platform
better—on top of your regular duties—than a monitoring vendor? They round to
zero. Is tool-building really the business you want to be in?

In addition, switching costs from in-house solutions are astronomically
higher than from one commercial solution to another, because of the
interoperability that commercial vendors have these days. Can the same be
said of your in-house solution?

4. You’re Monitoring the Wrong Stuff

Many years ago, at one of my first jobs, I checked out a database server
and noticed it had high CPU utilization. I figured I would let my boss
know.

“Who complained about it?”, my boss asked.

“Well, no one”, I replied.

My boss’ response has stuck with me. It taught me a valuable lesson:
“if it’s not impacting anyone, is there really a problem?”

My lesson is this: data without context isn’t useful. In monitoring, a
metric matters only in the context of users. If low free memory is a
condition you notice but it’s not impacting users, it’s not worth
firing an alert.

In all my years of operations and system administration, I’ve not once seen
an OS metric directly indicate active user impact. A metric sometimes
can be an indirect indicator, but I’ve never seen it directly indicate an
issue.

Which brings me to the next point. With all of these metrics and logs from
the infrastructure, why is your monitoring not better off? The reason is
because Ops can solve only half the problem. While monitoring nginx
workers, Tomcat garbage collection or Redis key evictions are all
important metrics for understanding infrastructure performance, none of
them help you understand the software your business runs. The biggest value
of monitoring comes from instrumenting the applications on which your users
rely.
(Unless, of course, your business provides infrastructure as a
service—then, by all means, carry on.)

Nowhere is this more clear than in a SaaS company, so let’s consider
that as an example.

Let’s say you have an application that is a standard three-tier web app:
nginx on the front end, Rails application servers and PostgreSQL on the
back end. Every action on the site hits the PostgreSQL database.

You have all the standard data: access and error logs, nginx metrics, Rails
logs, Postgres metrics. All of that is great.

You know what’s even better? Knowing how long it takes for a user to log in.
Or how many logins occur per minute. Or even better: how many login
failures occur per minute.

The reason this information is so valuable is that it tells you about the
user experience directly. If login failures rose during the past five
minutes, you know you have a problem on your hands.

But, you can’t see this sort of information from the infrastructure
perspective alone. If I were to pay attention only to the
nginx/Rails/Postgres performance, I would miss this incident entirely. I
would miss something like a recent code deployment that changed some
login-related code, which caused logins to fail.

To solve this, become closer friends with your engineering team. Help them
identify useful instrumentation points in the code and implement more
metrics and logging. I’m a big fan of the statsd protocol for this sort of
thing; most every monitoring vendor supports it (or their own
implementation of it).

5. You Are the Only One Who Cares

If you’re the only one who cares about monitoring, system performance and
useful metrics will never meaningfully improve. You can’t do this alone.
You can’t even do this if only your team cares. I can’t begin to count how
many times I’ve seen Ops teams put in the effort to make improvements, only
to realize no one outside the team paid attention or thought it mattered.

Improving monitoring requires company-wide buy-in. Everyone from the
receptionist to the CEO has to believe in the value of what you’re doing.
Everyone in the company knows the business needs to make a profit.
Similarly, it requires a company-wide understanding that improving
monitoring improves the bottom line and protects the company’s profit.

Ask yourself: why do you care about monitoring?

Is it because it helps you catch and resolve incidents faster? Why is that
important to you?

Why should that be important to your manager? To your manager’s
manager? Why should the CEO care?

You need to answer those questions. When you do so, you can start making
compelling business arguments for the investments required (including in
the best new tools).

Need a starting point? Here are a few ideas why the business might care
about improving monitoring:

The business can manage and mitigate the risk of incidents and
failures.
The business can spot areas for performance improvements, leading to a
better customer experience and increased revenue.
The business can resolve incidents faster (often before they become
critical), leading to more user goodwill and enhanced reputation.
The business avoids incidents going from bad to worse, which protects
against loss of revenue and potential SLA penalty payments.
The business better controls infrastructure costs through capacity
planning and forecasting, leading to improved profits and lower
expenses.

I recommend having a candid conversation with your team on why they care
about monitoring. Be sure to involve management as well. Once you’ve had
those conversations, repeat them again with your engineering team. And your
product management team. And marketing. And sales. And customer support.

Monitoring impacts the entire company, and often in different ways. By the
time you find yourself in a conversation with executives to request an
investment in monitoring, you will be able to speak their language. Go
forth and fix your monitoring. I hope you found at least a few ideas to
improve your monitoring. Becoming world-class in this is a long, hard,
expensive road, but the good news is that you don’t really need to be
among the best to see massive benefits. A few straightforward changes,
added over time, can radically improve your company’s monitoring.

To recap:

Use better tools. Replace them as better tools become available.
But, don’t fixate on the tools. The tools are there to help you solve
a problem—they aren’t the end goal.
Don’t worry about vendor lock-in. Pick products you like and go all-in
on them.
Be careful about what you collect and on what you issue alerts. The
best data tells you about things that have a direct user impact.
Learn why your company cares about monitoring and express it in
business outcomes. Only then can you really get the investment you
want.

Good luck, and happy monitoring.

Source

1. You’re Using Antiquated Tools

2. You’re Chasing “the New Hotness”

3. You’re Unnecessarily Afraid of “Vendor Lock-in”

4. You’re Monitoring the Wrong Stuff

5. You Are the Only One Who Cares

Leave a Reply Cancel reply