{"id":2238,"date":"2018-11-01T16:13:47","date_gmt":"2018-11-01T16:13:47","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=2238"},"modified":"2018-11-02T09:59:03","modified_gmt":"2018-11-02T09:59:03","slug":"why-your-server-monitoring-still-sucks","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/01\/why-your-server-monitoring-still-sucks\/","title":{"rendered":"Why Your Server Monitoring (Still) Sucks"},"content":{"rendered":"<p><em>Five observations about why your your server monitoring still<br \/>\nstinks by a monitoring specialist-turned-consultant.<\/em><\/p>\n<p>Early in my career, I was responsible for managing a large fleet of<br \/>\nprinters across a large campus. We&#8217;re talking several hundred networked<br \/>\nprinters. It often required a 10- or 15-minute walk to get to<br \/>\nsome of those printers physically, and many were used only sporadically. I<br \/>\ndidn&#8217;t<br \/>\nalways know what was happening until I arrived, so it was anyone&#8217;s<br \/>\nguess as to the problem. Simple paper jam? Driver issue? Printer currently<br \/>\non fire? I found out only after the long walk. Making this even more<br \/>\nfrustrating for everyone was that, thanks to the infrequent use of some of<br \/>\nthem, a printer with a problem might go unnoticed for weeks, making itself<br \/>\nknown only when someone tried to print with it.<\/p>\n<p>Finally, it occurred to me: wouldn&#8217;t it be nice if I knew about the problem<br \/>\nand the cause <em>before<\/em> someone called me? I found my first monitoring tool<br \/>\nthat day, and I was absolutely hooked.<\/p>\n<p>Since then, I&#8217;ve helped numerous people overhaul their monitoring<br \/>\nsystems. In doing so, I noticed the same challenges repeat themselves regularly. If<br \/>\nyou&#8217;re responsible for managing the systems at your organization, read<br \/>\non; I have much advice to dispense.<\/p>\n<p>So, without further ado, here are my top five reasons why your monitoring<br \/>\nis crap and what you can do about it.<\/p>\n<h3>1. You&#8217;re Using Antiquated Tools<\/h3>\n<p>By far, the most common reason for monitoring being screwed up is a<br \/>\nreliance on antiquated tools. You know that&#8217;s your issue when you spend<br \/>\nmore time working around the warts of your monitoring tools or when<br \/>\nyou&#8217;ve got a bunch of custom code to get around some major missing<br \/>\nfunctionality. But the bottom line is that you spend more time trying to<br \/>\nfix the almost-working tools than just getting on with your job.<\/p>\n<p>The problem with using antiquated tools and methodologies is that<br \/>\nyou&#8217;re just making it harder for yourself. I suppose it&#8217;s certainly<br \/>\npossible to dig a hole with a rusty spoon, but wouldn&#8217;t you prefer to use a<br \/>\nshovel?<\/p>\n<p>Great tools are invisible. They make you more effective, and the job is<br \/>\neasier to accomplish. When you have great tools, you don&#8217;t even notice<br \/>\nthem.<\/p>\n<p>Maybe you don&#8217;t describe your monitoring tools as &#8220;easy to use&#8221;<br \/>\nor &#8220;invisible&#8221;. The words you might opt to use would make my editor<br \/>\nbreak out a red pen.<\/p>\n<p>This checklist can help you determine if you&#8217;re screwing yourself.<\/p>\n<ul>\n<li>Are you using Nagios or a Nagios derivative to monitor<br \/>\nelastic\/ephemeral infrastructure?<\/li>\n<li>Is there a manual step in your deployment process for a human to &#8220;Add<br \/>\n$thing to monitoring&#8221;?<\/li>\n<li>How many post-mortems contained an action item such as, &#8220;We<br \/>\nweren&#8217;t monitoring $thing&#8221;?<\/li>\n<li>Do you have a cron job that tails a log file and sends an email via<br \/>\nsendmail?<\/li>\n<li>Do you have a syslog server to which all your systems forward their<br \/>\nlogs&#8230;never to be seen again?<\/li>\n<li>Do you collect system metrics only every five metrics (or even less<br \/>\noften)?<\/li>\n<\/ul>\n<p>If you answered yes to any of those, you are relying on bad, old-school<br \/>\ntooling. My condolences.<\/p>\n<p>The good news is your situation isn&#8217;t permanent. With a little work, you<br \/>\ncan fix it.<\/p>\n<p>If you&#8217;re ready to change, that is.<\/p>\n<p>It is somewhat amusing (or depressing?) that we in Ops so readily replace<br \/>\nentire stacks, redesign deployments over a week, replace configuration<br \/>\nmanagement tools and introduce modern technologies, such as Docker and<br \/>\nserverless\u2014all without any significant vetting period.<\/p>\n<p>Yet, changing a monitoring platform is <em>verboten<\/em>. What gives?<\/p>\n<p>I think the answer lies in the reality of the state of monitoring at many<br \/>\ncompanies. Things are pretty bad. They&#8217;re messy, inconsistent in<br \/>\nconfiguration, lack a coherent strategy, have inadequate automation&#8230;but<br \/>\nit&#8217;s all built on the tools we know. We know their failure modes; we know<br \/>\ntheir warts.<\/p>\n<p>For example, the industry has spent years and a staggering amount of<br \/>\ndevelopment hours bolting things onto Nagios to make it more palatable<br \/>\n(such as<br \/>\nnagios-herald, NagiosQL, OMD), instead of asking, &#8220;Are we throwing<br \/>\ngood money after bad?&#8221;<\/p>\n<p>The answer is yes. Yes we are.<\/p>\n<p>Not to pick on Nagios\u2014okay, yes, I&#8217;m going to pick on Nagios. Every change<br \/>\nto the Nagios config, such as adding or removing a host, requires a config<br \/>\nreload. In an infrastructure relying on ephemeral systems, such as<br \/>\ncontainers, the entire infrastructure may turn over every few minutes. If<br \/>\nyou have two dozen containers churning every 15 minutes, it&#8217;s possible that<br \/>\nNagios is reloading its config more than once a minute. That&#8217;s insane.<\/p>\n<p>And what about your metrics? The old way to decide whether something was broken<br \/>\nwas to check the current value of a check output against a threshold. That<br \/>\nclearly results in some false alarms, so we added the ability to fire<br \/>\nan alert only if N number of consecutive checks violated the threshold. That has<br \/>\na pretty glaring problem too. If you get your data every minute, you may<br \/>\nnot know of a problem until 3\u20135 minutes after it&#8217;s happened. If you&#8217;re<br \/>\ngetting your data every five minutes, it&#8217;s even worse.<\/p>\n<p>And while I&#8217;m on my soapbox, let&#8217;s talk about automation. I remember back<br \/>\nwhen I was responsible for a dozen servers. It was a big day when I spun up<br \/>\nserver #13. These sorts of things happened only every few months. Adding my<br \/>\nnew server to my monitoring tools was, of course, on my checklist, and it<br \/>\ncertainly took more than a few minutes to do.<\/p>\n<p>But the world of tech isn&#8217;t like that anymore. Just this morning, a<br \/>\nclient&#8217;s infrastructure spun up a dozen new instances and spun down<br \/>\nhalf of them an hour later. I knew it happened only after the fact. The<br \/>\nmonitoring systems knew about the events within seconds, and they adjusted<br \/>\naccordingly.<\/p>\n<p>The tech world has changed dramatically in the past five years. Our beloved<br \/>\ntools of choice haven&#8217;t quite kept pace. Monitoring must be 100% automated,<br \/>\nboth in registering new instances and services, and in de-registering them<br \/>\nall when they go away. Gone are the days when you can deal with a 5 (or<br \/>\n15!) minute delay in knowing something went wrong; many of the top<br \/>\ncompanies know within seconds that&#8217;s something isn&#8217;t right.<\/p>\n<p>Continuing to rely on methodologies and tools from the old days, no matter<br \/>\nhow much you enjoy them and know their travails, is holding you back from<br \/>\ngiant leaps forward in your monitoring.<\/p>\n<p>The bad old days of trying to pick between three equally terrible<br \/>\nmonitoring tools are long over. You owe it to yourself and your company to<br \/>\nat least consider modern tooling\u2014whether it&#8217;s SaaS or self-hosted<br \/>\nsolutions.<\/p>\n<h3>2. You&#8217;re Chasing &#8220;the New Hotness&#8221;<\/h3>\n<p>At the other end of the spectrum is an affinity for new-and-exciting tools.<br \/>\nCompanies like Netflix and Facebook publish some really cool stuff, sure.<br \/>\nBut that doesn&#8217;t necessarily mean you should be using it.<\/p>\n<p>Here&#8217;s the problem: you are (probably) not Facebook, Netflix, Google or<br \/>\nany of the other huge tech companies everyone looks up to. <a href=\"https:\/\/www.scientificamerican.com\/article\/1959-cargo-cults-melanesia\">Cargo<br \/>\nculting<\/a><br \/>\nnever made anything better.<\/p>\n<p>Adopting someone else&#8217;s tools or strategy because they&#8217;re successful with<br \/>\nthem misses the crucial reasons of <em>why<\/em> it works for them.<\/p>\n<p>The tools don&#8217;t make an organization successful. The organization is<br \/>\nsuccessful because of how its members think. Its approaches, beliefs,<br \/>\npeople and strategy led the organization to create those tools. Its<br \/>\nsuccess stems from something much deeper than, &#8220;We wrote our own monitoring<br \/>\nplatform.&#8221;<\/p>\n<p>To approach the same sort of success the industry titans are having, you<br \/>\nhave to go deeper. What do they do know that you don&#8217;t? What are they<br \/>\ndoing, thinking, saying, believing that you aren&#8217;t?<\/p>\n<p>Having been on the inside of many of those companies, I&#8217;ll let you in on<br \/>\nthe secret: they&#8217;re good at the fundamentals. Really good. Mind-blowingly<br \/>\ngood.<\/p>\n<p>At first glance, this seems unrelated, but allow me to quote John Gall,<br \/>\nfamed systems theorist:<\/p>\n<blockquote><p>A complex system that works is invariably found to have evolved<br \/>\nfrom a simple system that worked. A complex system designed from scratch<br \/>\nnever works and cannot be patched up to make it work. You have to start<br \/>\nover, beginning with a working simple system.<\/p><\/blockquote>\n<p>Dr. Gall quite astutely points out the futility of adopting other people&#8217;s<br \/>\ntools wholesale. Those tools evolved from simple systems to suit the needs<br \/>\nof that organization and culture. Dropping such a complex system into<br \/>\nanother organization or culture may not yield favorable results, simply<br \/>\nbecause you&#8217;re attempting to shortcut the hard work of evolving a simple<br \/>\nsystem.<\/p>\n<p>So, you want the same success as the veritable titans of industry? The<br \/>\nanswer is straightforward: start simple. Improve over time. Be patient.<\/p>\n<h3>3. You&#8217;re Unnecessarily Afraid of &#8220;Vendor Lock-in&#8221;<\/h3>\n<p>If there&#8217;s one argument I wish would die, it&#8217;s the one where people opine<br \/>\nabout wanting to &#8220;avoid vendor lock-in&#8221;. That argument is utter hogwash.<\/p>\n<p>What is &#8220;vendor lock-in&#8221;, anyway? It&#8217;s the notion that if you were to go<br \/>\nall-in on a particular vendor&#8217;s product, it would become prohibitively<br \/>\ndifficult or expensive to change. Keurig&#8217;s K-cups are a famous example of<br \/>\nvendor lock-in. They can be used only with a Keurig coffee machine, and<br \/>\na Keurig coffee machine accepts only the proprietary Keurig K-cups. By<br \/>\nbuying a Keurig, you&#8217;re locked into the Keurig ecosystem.<\/p>\n<p>Thus, if I were worried about being locked in to the Keurig ecosystem, I&#8217;d<br \/>\njust avoid buying a Keurig machine. Easy.<\/p>\n<p>If I&#8217;m worried about vendor lock-in with, say, my server infrastructure,<br \/>\nwhat do I do? Roll out both Dell and HP servers together? That seems like a<br \/>\nreally dumb idea. It makes my job way more difficult. I&#8217;d have to build to<br \/>\nthe lowest common denominator of each product and ignore any<br \/>\nproduct-specific features, including the innovations that make a product<br \/>\nappealing. This ostensibly would allow me to avoid being locked in to one<br \/>\nvendor and keep any switching costs low, but it also means I&#8217;ve got a<br \/>\nsolution that only half works and is a nightmare to manage at any sort of<br \/>\nscale. (Have you ever tried to build tools to manage and automate both<br \/>\niDRAC and IPMI? You really don&#8217;t want to.)<\/p>\n<p>In particular, you don&#8217;t get to take advantage of a product&#8217;s<br \/>\nunique features. By trying to avoid vendor lock-in, you end up with a<br \/>\n&#8220;solution&#8221; that ignores any advanced functionality.<\/p>\n<p>When it comes to monitoring products, this is even worse. Composability and<br \/>\ninteroperability are a core tenet of most products available to you. The<br \/>\nstate of monitoring solutions today favors a high degree of interoperability<br \/>\nand open APIs. Yes, a single vendor may have all of your data, but it&#8217;s<br \/>\noften trivial to move that same data to another vendor without a major loss<br \/>\nof functionality.<\/p>\n<p>One particular problem with this whole vendor lock-in argument is that it&#8217;s<br \/>\noften used as an excuse to not buy SaaS or commercial, proprietary<br \/>\napplications. The perception is that by using only self-hosted, open-source<br \/>\nproducts, you gain more freedom.<\/p>\n<p>That assumption is wrong. You haven&#8217;t gained more freedom or avoided vendor<br \/>\nlock-in at all. You&#8217;ve traded one vendor for another.<\/p>\n<p>By opting to do it all yourself (usually poorly), you effectively become<br \/>\nyour own vendor\u2014a less experienced, more overworked vendor. The chances<br \/>\nyou would design, build, maintain and improve a monitoring platform<br \/>\nbetter\u2014on top of your regular duties\u2014than a monitoring vendor? They round to<br \/>\nzero. Is tool-building really the business you want to be in?<\/p>\n<p>In addition, switching costs from in-house solutions are astronomically<br \/>\nhigher than from one commercial solution to another, because of the<br \/>\ninteroperability that commercial vendors have these days. Can the same be<br \/>\nsaid of your in-house solution?<\/p>\n<h3>4. You&#8217;re Monitoring the Wrong Stuff<\/h3>\n<p>Many years ago, at one of my first jobs, I checked out a database server<br \/>\nand noticed it had high CPU utilization. I figured I would let my boss<br \/>\nknow.<\/p>\n<p>&#8220;Who complained about it?&#8221;, my boss asked.<\/p>\n<p>&#8220;Well, no one&#8221;, I replied.<\/p>\n<p>My boss&#8217; response has stuck with me. It taught me a valuable lesson:<br \/>\n&#8220;if it&#8217;s not impacting anyone, is there really a problem?&#8221;<\/p>\n<p>My lesson is this: data without context isn&#8217;t useful. In monitoring, a<br \/>\nmetric matters only in the context of users. If low free memory is a<br \/>\ncondition you notice but it&#8217;s not impacting users, it&#8217;s not worth<br \/>\nfiring an alert.<\/p>\n<p>In all my years of operations and system administration, I&#8217;ve not once seen<br \/>\nan OS metric directly indicate active user impact. A metric sometimes<br \/>\ncan be an <em>indirect<\/em> indicator, but I&#8217;ve never seen it <em>directly<\/em> indicate an<br \/>\nissue.<\/p>\n<p>Which brings me to the next point. With all of these metrics and logs from<br \/>\nthe infrastructure, why is your monitoring not better off? The reason is<br \/>\nbecause Ops can solve only half the problem. While monitoring nginx<br \/>\nworkers, Tomcat garbage collection or Redis key evictions are all<br \/>\nimportant metrics for understanding infrastructure performance, none of<br \/>\nthem help you understand the software your business runs. The biggest value<br \/>\nof monitoring comes from instrumenting the applications on which your users<br \/>\nrely.<br \/>\n(Unless, of course, your business provides infrastructure as a<br \/>\nservice\u2014then, by all means, carry on.)<\/p>\n<p>Nowhere is this more clear than in a SaaS company, so let&#8217;s consider<br \/>\nthat as an example.<\/p>\n<p>Let&#8217;s say you have an application that is a standard three-tier web app:<br \/>\nnginx on the front end, Rails application servers and PostgreSQL on the<br \/>\nback end. Every action on the site hits the PostgreSQL database.<\/p>\n<p>You have all the standard data: access and error logs, nginx metrics, Rails<br \/>\nlogs, Postgres metrics. All of that is great.<\/p>\n<p>You know what&#8217;s even better? Knowing how long it takes for a user to log in.<br \/>\nOr how many logins occur per minute. Or even better: how many login<br \/>\nfailures occur per minute.<\/p>\n<p>The reason this information is so valuable is that it <em>tells you about the<br \/>\nuser experience directly<\/em>. If login failures rose during the past five<br \/>\nminutes, you know you have a problem on your hands.<\/p>\n<p>But, you can&#8217;t see this sort of information from the infrastructure<br \/>\nperspective alone. If I were to pay attention only to the<br \/>\nnginx\/Rails\/Postgres performance, I would miss this incident entirely. I<br \/>\nwould miss something like a recent code deployment that changed some<br \/>\nlogin-related code, which caused logins to fail.<\/p>\n<p>To solve this, become closer friends with your engineering team. Help them<br \/>\nidentify useful instrumentation points in the code and implement more<br \/>\nmetrics and logging. I&#8217;m a big fan of the statsd protocol for this sort of<br \/>\nthing; most every monitoring vendor supports it (or their own<br \/>\nimplementation of it).<\/p>\n<h3>5. You Are the Only One Who Cares<\/h3>\n<p>If you&#8217;re the only one who cares about monitoring, system performance and<br \/>\nuseful metrics will never meaningfully improve. You can&#8217;t do this alone.<br \/>\nYou can&#8217;t even do this if only your team cares. I can&#8217;t begin to count how<br \/>\nmany times I&#8217;ve seen Ops teams put in the effort to make improvements, only<br \/>\nto realize no one outside the team paid attention or thought it mattered.<\/p>\n<p>Improving monitoring requires company-wide buy-in. Everyone from the<br \/>\nreceptionist to the CEO has to believe in the value of what you&#8217;re doing.<br \/>\nEveryone in the company knows the business needs to make a profit.<br \/>\nSimilarly, it requires a company-wide understanding that improving<br \/>\nmonitoring improves the bottom line and protects the company&#8217;s profit.<\/p>\n<p>Ask yourself: why do you care about monitoring?<\/p>\n<p>Is it because it helps you catch and resolve incidents faster? Why is that<br \/>\nimportant to you?<\/p>\n<p>Why should that be important to your manager? To your manager&#8217;s<br \/>\nmanager? Why should the CEO care?<\/p>\n<p>You need to answer those questions. When you do so, you can start making<br \/>\ncompelling business arguments for the investments required (including in<br \/>\nthe best new tools).<\/p>\n<p>Need a starting point? Here are a few ideas why the business might care<br \/>\nabout improving monitoring:<\/p>\n<ul>\n<li>The business can manage and mitigate the risk of incidents and<br \/>\nfailures.<\/li>\n<li>The business can spot areas for performance improvements, leading to a<br \/>\nbetter customer experience and increased revenue.<\/li>\n<li>The business can resolve incidents faster (often before they become<br \/>\ncritical), leading to more user goodwill and enhanced reputation.<\/li>\n<li>The business avoids incidents going from bad to worse, which protects<br \/>\nagainst loss of revenue and potential SLA penalty payments.<\/li>\n<li>The business better controls infrastructure costs through capacity<br \/>\nplanning and forecasting, leading to improved profits and lower<br \/>\nexpenses.<\/li>\n<\/ul>\n<p>I recommend having a candid conversation with your team on why they care<br \/>\nabout monitoring. Be sure to involve management as well. Once you&#8217;ve had<br \/>\nthose conversations, repeat them again with your engineering team. And your<br \/>\nproduct management team. And marketing. And sales. And customer support.<\/p>\n<p>Monitoring impacts the entire company, and often in different ways. By the<br \/>\ntime you find yourself in a conversation with executives to request an<br \/>\ninvestment in monitoring, you will be able to speak their language. Go<br \/>\nforth and fix your monitoring. I hope you found at least a few ideas to<br \/>\nimprove your monitoring. Becoming world-class in this is a long, hard,<br \/>\nexpensive road, but the good news is that you don&#8217;t really need to be<br \/>\namong the best to see massive benefits. A few straightforward changes,<br \/>\nadded over time, can radically improve your company&#8217;s monitoring.<\/p>\n<p>To recap:<\/p>\n<ol>\n<li>Use better tools. Replace them as better tools become available.<\/li>\n<li>But, don&#8217;t fixate on the tools. The tools are there to help you solve<br \/>\na problem\u2014they aren&#8217;t the end goal.<\/li>\n<li>Don&#8217;t worry about vendor lock-in. Pick products you like and go all-in<br \/>\non them.<\/li>\n<li>Be careful about what you collect and on what you issue alerts. The<br \/>\nbest data tells you about things that have a direct user impact.<\/li>\n<li>Learn why your company cares about monitoring and express it in<br \/>\nbusiness outcomes. Only then can you really get the investment you<br \/>\nwant.<\/li>\n<\/ol>\n<p>Good luck, and happy monitoring.<\/p>\n<p><a href=\"https:\/\/www.linuxjournal.com\/content\/why-your-server-monitoring-still-sucks\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Five observations about why your your server monitoring still stinks by a monitoring specialist-turned-consultant. Early in my career, I was responsible for managing a large fleet of printers across a large campus. We&#8217;re talking several hundred networked printers. It often required a 10- or 15-minute walk to get to some of those printers physically, and &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/11\/01\/why-your-server-monitoring-still-sucks\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Why Your Server Monitoring (Still) Sucks&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2238","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2238","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=2238"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2238\/revisions"}],"predecessor-version":[{"id":2343,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2238\/revisions\/2343"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=2238"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=2238"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=2238"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}