{"id":2202,"date":"2018-10-31T20:13:00","date_gmt":"2018-10-31T20:13:00","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=2202"},"modified":"2018-11-02T10:22:48","modified_gmt":"2018-11-02T10:22:48","slug":"cloudwatch-is-of-the-devil-but-i-must-use-it","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/10\/31\/cloudwatch-is-of-the-devil-but-i-must-use-it\/","title":{"rendered":"CloudWatch Is of the Devil, but I Must Use It"},"content":{"rendered":"<p><em>Let&#8217;s talk about Amazon CloudWatch.<\/em><\/p>\n<p>For those fortunate enough to not be stuck in the weeds of Amazon Web<br \/>\nServices (AWS), CloudWatch is, and I quote from the official<br \/>\n<a href=\"https:\/\/aws.amazon.com\/cloudwatch\">AWS description<\/a>, &#8220;a monitoring and<br \/>\nmanagement service built for developers, system operators, site reliability<br \/>\nengineers (SRE), and IT managers.&#8221; This is all well and good, except for the<br \/>\npart where there isn&#8217;t a single named constituency who enjoys working with<br \/>\nthe product. Allow me to dispense some monitoring heresy.<\/p>\n<p>Better, let me describe this in the context of the 14 <a href=\"https:\/\/www.amazon.jobs\/principles\">Amazon<br \/>\nLeadership Principles<\/a> that reportedly guide every decision Amazon makes.<br \/>\nWhen you take a hard look at CloudWatch&#8217;s complete failure across all<br \/>\n14 Leadership Principles, you wonder how this product ever made it out<br \/>\nthe door in its current state.<\/p>\n<h3>&#8220;Frugality&#8221;<\/h3>\n<p>I&#8217;ll start with billing. Normally left for the tail end of articles like<br \/>\nthis, the CloudWatch billing paradigm is so terrible, I&#8217;m leading with<br \/>\nit instead. You get billed per metric, per month. You get billed per<br \/>\nthousand metrics you request to view via the API. You get billed per<br \/>\ndashboard per month. You get billed per alarm per month. You get charged for<br \/>\nlogs based upon data volume ingested, data volume stored and &#8220;vended logs&#8221;<br \/>\nthat get published natively by AWS services on behalf of the customer. And,<br \/>\nyou get billed per custom event. All of this can be summed up best as<br \/>\n&#8220;nobody on the planet understands how your CloudWatch metrics and logs get<br \/>\nbilled&#8221;, and it leads to scenarios where monitoring vendors can inadvertently<br \/>\ncost you thousands of dollars by polling CloudWatch too frequently. When the<br \/>\nAWS charges are larger than what you&#8217;re paying your monitoring vendor, it&#8217;s<br \/>\nnot a wonderful feeling.<\/p>\n<h3>&#8220;Invent and Simplify&#8221;<\/h3>\n<p>CloudWatch Logs, CloudWatch Events, Custom Metrics, Vended Logs and Custom<br \/>\nDashboards all mean different things internally to CloudWatch from what you&#8217;d<br \/>\nexpect, compared to metrics solutions that actually make some fathomable<br \/>\nlevel of sense. There are, thus, multiple services that do very different<br \/>\nthings, all operating under the &#8220;CloudWatch&#8221; moniker. For example, it&#8217;s not<br \/>\nparticularly intuitive to most people that scheduling a Lambda function to<br \/>\ninvoke once an hour requires a custom CloudWatch Event. It feels overly<br \/>\ncomplicated, incredibly confusing, and very quickly, you find yourself in a<br \/>\nsituation where you&#8217;re having to build complex relationships to monitor<br \/>\nthings that are themselves far simpler.<\/p>\n<h3>&#8220;Think Big&#8221;<\/h3>\n<p>All business people, when asked what they want from a monitoring platform,<br \/>\nwill respond with something that resembles &#8220;a dashboard&#8221; or &#8220;a<br \/>\nsingle pane of glass view&#8221;. CloudWatch offers minutia up the wazoo, but<br \/>\nit categorically offers no global view, no green\/yellow\/red status<br \/>\nindicator that gives you even a glimmer of the overall health of your site.<br \/>\nWant a graph of each core in your instance&#8217;s CPU for the past 30<br \/>\nseconds? Easy! Want to know if your entire company should be putting out the<br \/>\nburning fire that is the current production state of your website? Keep<br \/>\nlooking\u2014CloudWatch has nothing to offer you.<\/p>\n<h3>&#8220;Insist on the Highest Standards&#8221;<\/h3>\n<p>By its very nature, CloudWatch feels like small thinking. The entire<br \/>\nexperience, start to finish, smacks of &#8220;what&#8217;s the absolute least we<br \/>\ncould do and get away with it?&#8221; They built their MVP, and then just<br \/>\nsorta&#8230;stopped, frozen in amber. They created a set of building blocks,<br \/>\nexcept they didn&#8217;t solve the problem of &#8220;how do I monitor my AWS resources?&#8221;<br \/>\nInstead, it feels like the entire team phoned it in and let a large market<br \/>\nof monitoring vendors develop as a result. None of those vendors have the<br \/>\nlevel of access to the raw data that CloudWatch does; all of them have built<br \/>\nbetter products. You&#8217;d think the CloudWatch team would take a clue from<br \/>\nthe innovation that&#8217;s rapidly happening in this space, but that&#8217;d<br \/>\nrequire someone to Learn and Be Curious.<\/p>\n<h3>&#8220;Are Right, a Lot&#8221;<\/h3>\n<p>Recent data is &#8220;eventually consistent&#8221;, so you always get graphs like the<br \/>\none shown in Figure 1.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.linuxjournal.com\/sites\/default\/files\/styles\/max_650x650\/public\/u%5Buid%5D\/12565f1.png\" alt=\"CloudWatch Graph\" width=\"506\" height=\"590\" \/><\/p>\n<p><em>Figure 1. Example CloudWatch Graph<\/em><\/p>\n<p>Here in reality, that would be a terrifying thing to see on an <em>accurate<br \/>\ndashboard<\/em>\u2014something is obviously very wrong with your site! For better or<br \/>\nworse, the &#8220;accurate&#8221; description doesn&#8217;t apply to CloudWatch, and that&#8217;s<br \/>\njust how your graphs always look. &#8220;Your metrics will be eventually<br \/>\nconsistent&#8221; is very nearly the last thing you want to hear about your<br \/>\nmonitoring platform, second only to &#8220;what metrics?&#8221; This ties directly<br \/>\nto&#8230;<\/p>\n<h3>&#8220;Earn Trust&#8221;<\/h3>\n<p>Let me be very clear here\u2014the real issue isn&#8217;t the ingestion problem.<br \/>\nAbsolutely every vendor on the planet has the same issue\u2014you can&#8217;t<br \/>\ndisplay data you don&#8217;t have. Where CloudWatch drops the ball is in<br \/>\nexposing this behavior to the end user without explanation as to what&#8217;s<br \/>\ngoing on. Thus, until you grow accustomed to it, you have a heart-stopping<br \/>\nmoment of &#8220;what the hell just happened to the site&#8221; whenever you<br \/>\nglance at a dashboard. This conditions you to be entirely too calm when<br \/>\nlooking at sensible dashboards when a disaster just happened. If you trust<br \/>\nwhat the CloudWatch dashboards show you, you&#8217;re making a terrible<br \/>\nmistake.<\/p>\n<h3>&#8220;Dive Deep&#8221;<\/h3>\n<p>If you&#8217;re using Lambda or Fargate, you have no choice but to use CloudWatch<br \/>\nLogs, wherein searching for everything is absolutely terrible. If you&#8217;re<br \/>\nusing CloudWatch Logs to diagnose anything, congratulations: you&#8217;re<br \/>\ndiving so deep, you may drown before making it back to the surface.<br \/>\nFor example, if I have a Lambda function that throws an error, in order to<br \/>\ndiagnose the problem, I must:<\/p>\n<ul>\n<li>Find the fact that it encountered an error in the first place by looking at<br \/>\nthe invocation error CloudWatch dashboard. I also could set up a filter to<br \/>\nrun a continuous query on the logs and alert when something shows up, except<br \/>\nthat isn&#8217;t natively supported\u2014I need a third-party tool for that (such<br \/>\nas<br \/>\nPagerDuty).<\/li>\n<li>Go diving into a variety of CloudWatch log groups and find the one named<br \/>\nafter the specific erroring function.<\/li>\n<li>Scroll manually through the many, many, many pages of log groups to find the<br \/>\nspecific invocation that threw an error.<\/li>\n<li>Realize that the JSON object that&#8217;s retained isn&#8217;t enough to troubleshoot<br \/>\nwith, cry in despair, and go write an article just like this one.<\/li>\n<li>Do some quick math and realize I&#8217;m paying an uncomfortable percentage of my<br \/>\nAWS bill for a service that&#8217;s only of somewhat marginal utility at best.<\/li>\n<\/ul>\n<h3>&#8220;Deliver Results&#8221;<\/h3>\n<p>All of your metrics, all of your logs\u2014they&#8217;re locked away inside<br \/>\nCloudWatch&#8217;s various components. You&#8217;re not going to find a<br \/>\n&#8220;page me when this threshold is exceeded&#8221; option in CloudWatch; your<br \/>\noptions are relegated to &#8220;design an alert delivery pipeline with baling<br \/>\nwire and SNS&#8221; or pay a non-AWS vendor for another monitoring product.<\/p>\n<h3>&#8220;Customer Obsession&#8221;<\/h3>\n<p>CloudWatch keeps all of your metrics. It keeps your logs. It lets you build<br \/>\ncustom dashboards to view your metrics all in one place. The building blocks<br \/>\nof a great service are already here\u2014it&#8217;s the expression of that utility<br \/>\nthat falls short, sometimes drastically. The fact that large monitoring<br \/>\nvendors are premier sponsors of AWS events would be laughable if CloudWatch<br \/>\never were to get its act together. You&#8217;d not need a third party to make<br \/>\nsense of a pure AWS environment, and many of them would starve to death as<br \/>\nthey grow too weak to interrupt your conversation to ask if they can scan<br \/>\nyour badge. Choosing to use CloudWatch vs. literally anything else is like<br \/>\nbuying a car. &#8220;Why yes, I would like to buy the Yugo instead of the Honda.<br \/>\nAfter all, it checks all the boxes of technically being a car, so it&#8217;s fine,<br \/>\nright?&#8221;<\/p>\n<h3>&#8220;Disagree and Commit&#8221;<\/h3>\n<p>It may very well be that the root cause of many of CloudWatch&#8217;s failings<br \/>\ncomes from the product engineers who built it misunderstanding this<br \/>\n(admittedly slippery!) Leadership Principle. It&#8217;s envisioned as<br \/>\npassionately expressing your reservations about a decision, but once<br \/>\nit&#8217;s reached that you commit to the decision that was made.<br \/>\nUnfortunately, it appears that the engineering teams responsible for<br \/>\nCloudWatch decided to &#8220;Disagree in Commits&#8221; and inflict their<br \/>\narguments upon the world in the form of the product.<\/p>\n<h3>&#8220;Ownership&#8221;<\/h3>\n<p>If I were to go on the internet and post about how terrible virtually any<br \/>\nother AWS service was, people would rally to that service&#8217;s defense.<br \/>\nIt&#8217;s the internet; people will do that. But when these and many more<br \/>\nsimilar comments about CloudWatch appear, and nobody from AWS pipes in to<br \/>\nsay &#8220;wow, I&#8217;m sorry, why do you feel that way?&#8221;, it&#8217;s<br \/>\nabundantly clear that if any people on the CloudWatch team really care about<br \/>\nthe product, they&#8217;ve been locked in a malfunctioning bathroom stall for<br \/>\nthe better part of a decade. <a href=\"https:\/\/twitter.com\/shinzui\/status\/788939026996744192\">These<\/a> <a href=\"https:\/\/news.ycombinator.com\/item?id=12235003\">comments<\/a> go back at least that far, but<br \/>\n<a href=\"https:\/\/www.reddit.com\/r\/devops\/comments\/8n3fpz\/is_cloudwatch_logs_really_terrible_or_am_i_just\/dzsz8qg\">Amazon<\/a><br \/>\n<a href=\"https:\/\/www.reddit.com\/r\/devops\/comments\/4zhgtl\/how_do_you_feel_about_aws_cloudwatch_how_do_you\">is<\/a><br \/>\n<a href=\"https:\/\/twitter.com\/guisim\/status\/248394260704014336\">totally<\/a><br \/>\n<a href=\"https:\/\/twitter.com\/calebhailey\/status\/1032800895203864576\">on<\/a><br \/>\n<a href=\"https:\/\/news.ycombinator.com\/item?id=14604644\">it<\/a>, rocking<br \/>\nthe company&#8217;s &#8220;Bias for Action&#8221; principle.<\/p>\n<h3>&#8220;Hire and Develop the Best&#8221;<\/h3>\n<p>The people who build CloudWatch aren&#8217;t terrible at their jobs; I<br \/>\ngenuinely believe they don&#8217;t quite grasp how their product is perceived.<br \/>\nGiven that it&#8217;s poor form to write a rant like this and not offer<br \/>\nsuggestions for positive improvement, here are some product enhancements I&#8217;d<br \/>\nlike to see:<\/p>\n<ul>\n<li>Give me the option to rate-limit API calls at arbitrary levels rather than<br \/>\nbeing surprised at month end by a bill that&#8217;s approximately Zanzibar&#8217;s<br \/>\nGDP.<\/li>\n<li>&#8220;Here&#8217;s an error that your Lambda function threw, here&#8217;s the log output from<br \/>\nthat specific function&#8221; should be at most two clicks away\u2014not 30.<\/li>\n<li>If your dog has a litter of 14 puppies, perhaps you don&#8217;t need to name<br \/>\nall of them subtle variations of the term &#8220;CloudWatch&#8221;. The proliferation of<br \/>\nservices and companies that all start with the word &#8220;Cloud&#8221; is the subject<br \/>\nof a completely separate rant.<\/li>\n<\/ul>\n<p>Please don&#8217;t misunderstand me. I use, enjoy and promote AWS services,<br \/>\nand I&#8217;m considered to be &#8220;an authentic voice&#8221; largely because in<br \/>\naddition to praising things that are wonderful, I&#8217;ll call out things<br \/>\nthat aren&#8217;t, as I&#8217;ve just done. I&#8217;ve built my career and<br \/>\nbusiness on working within that ecosystem. I find AWS employees to be<br \/>\nintelligent and well-intentioned, and most of their services quite good.<br \/>\nCloudWatch could get there with some work, but it&#8217;s got a number of very<br \/>\npainful usability issues that keep it from being good, let alone great.<\/p>\n<p><a href=\"https:\/\/www.linuxjournal.com\/content\/cloudwatch-devil-i-must-use-it\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s talk about Amazon CloudWatch. For those fortunate enough to not be stuck in the weeds of Amazon Web Services (AWS), CloudWatch is, and I quote from the official AWS description, &#8220;a monitoring and management service built for developers, system operators, site reliability engineers (SRE), and IT managers.&#8221; This is all well and good, except &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2018\/10\/31\/cloudwatch-is-of-the-devil-but-i-must-use-it\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;CloudWatch Is of the Devil, but I Must Use It&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-2202","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2202","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=2202"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2202\/revisions"}],"predecessor-version":[{"id":2378,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/2202\/revisions\/2378"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=2202"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=2202"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=2202"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}