{"id":10108,"date":"2019-02-24T02:16:56","date_gmt":"2019-02-24T02:16:56","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw92\/?p=10108"},"modified":"2019-03-14T10:30:42","modified_gmt":"2019-03-14T10:30:42","slug":"infrastructure-monitoring-defense-against-surprise-downtime","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/02\/24\/infrastructure-monitoring-defense-against-surprise-downtime\/","title":{"rendered":"Infrastructure monitoring: Defense against surprise downtime"},"content":{"rendered":"<div class=\"os-article__top\">\n<div class=\"os-article__top-inner\">\n<div class=\"panel-pane pane-entity-field pane-node-field-article-subhead\">\n<div class=\"field field-name-field-article-subhead field-type-text-long field-label-hidden\">\n<div class=\"field-items\">\n<h2>A strong monitoring and alert system based on open source tools prevents problems before they affect your infrastructure.<\/h2>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"panel-pane pane-os-content-article-byline\"><\/div>\n<\/div>\n<\/div>\n<div class=\"os-article__image\">\n<div class=\"panel-pane pane-entity-field pane-node-field-lead-image\">\n<div class=\"field field-name-field-lead-image field-type-image field-label-hidden\">\n<div class=\"field-items\">\n<div class=\"field-item even\"><img loading=\"lazy\" decoding=\"async\" class=\"image-full-size\" title=\"Analytics: Charts and Graphs\" src=\"https:\/\/opensource.com\/sites\/default\/files\/styles\/image-full-size\/public\/lead-images\/analytics-graphs-charts.png?itok=sersoqbV\" alt=\"Analytics: Charts and Graphs\" width=\"520\" height=\"292\" \/><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"os-article__middle\">\n<div class=\"panel-pane pane-entity-field pane-node-body\">\n<div class=\"field field-name-body field-type-text-with-summary field-label-hidden\">\n<div class=\"field-items\">\n<div class=\"field-item even\">\n<p>Infrastructure monitoring is an integral part of infrastructure management. It is an IT manager&#8217;s first line of defense against surprise downtime. Severe issues can inject considerable downtime to live infrastructure, sometimes causing heavy loss of money and material.<\/p>\n<p>Monitoring collects time-series data from your infrastructure so it can be analyzed to predict upcoming issues with the infrastructure and its underlying components. This gives the IT manager or support staff time to prepare and apply a resolution before a problem occurs.<\/p>\n<p>A good monitoring system provides:<\/p>\n<ol>\n<li>Measurement of the infrastructure&#8217;s performance over time<\/li>\n<li>Node-level analysis and alerts<\/li>\n<li>Network-level analysis and alerts<\/li>\n<li>Downtime analysis and alerts<\/li>\n<li>Answers to the 5 W&#8217;s of incident management and root cause analysis (RCA):\n<ul>\n<li>What was the actual issue?<\/li>\n<li>When did it happen?<\/li>\n<li>Why did it happen?<\/li>\n<li>What was the downtime?<\/li>\n<li>What needs to be done to avoid it in the future?<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h2 id=\"building-a-strong-monitoring-system\">Building a strong monitoring system<\/h2>\n<p>There are a number of tools available that can build a viable and strong monitoring system. The only decision to make is which to use; your answer lies in what you want to achieve with monitoring as well as various financial and business factors you must consider.<\/p>\n<p>While some monitoring tools are proprietary, many open source tools, either unmanaged or community-managed software, will do the job even better than the closed source options.<\/p>\n<p>In this article, I will focus on open source tools and how to use them to create a strong monitoring architecture.<\/p>\n<h2 id=\"log-collection-and-analysis\">Log collection and analysis<\/h2>\n<p>To say &#8220;logs are helpful&#8221; would be an understatement. Logs not only help in debugging issues; they also provide a lot of information to help you predict an upcoming issue.\u00a0Logs are the first door to open when you encounter issues with software components.<\/p>\n<p>Both\u00a0<a href=\"https:\/\/www.fluentd.org\/\" target=\"_blank\" rel=\"noopener\">Fluentd<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.elastic.co\/products\/logstash\" target=\"_blank\" rel=\"noopener\">Logstash<\/a>\u00a0can be used for log collection; the only reason I would choose Fluentd over Logstash is because of its independence from the Java process; it is written in C+ Ruby, which is widely supported by container runtimes like Docker and orchestration tools like Kubernetes.<\/p>\n<p>Log analytics is the process of analyzing the log data you collect over time and producing real-time logging metrics.\u00a0<a href=\"https:\/\/www.elastic.co\/products\/elasticsearch\" target=\"_blank\" rel=\"noopener\">Elasticsearch<\/a>\u00a0is a powerful tool that can do just that.<\/p>\n<p>Finally, you need a tool that can collect logging metrics and enable you to visualize the log trends using charts and graphs that are easy to understand.\u00a0<a href=\"https:\/\/www.elastic.co\/products\/kibana\" target=\"_blank\" rel=\"noopener\">Kibana<\/a>\u00a0is my favorite option for that purpose.<\/p>\n<div class=\"media media-element-container media-default media-wysiwyg-align-center\">\n<div id=\"file-420766\" class=\"file file-image file-image-jpeg\">\n<h2 class=\"element-invisible\"><a href=\"https:\/\/opensource.com\/file\/420766\">infrastructure-monitoring_logging.jpeg<\/a><\/h2>\n<div class=\"content\"><img loading=\"lazy\" decoding=\"async\" class=\"media-element file-default\" title=\"Logging workflow\" src=\"https:\/\/opensource.com\/sites\/default\/files\/uploads\/infrastructure-monitoring_logging.jpeg\" alt=\"Logging workflow\" width=\"650\" height=\"217\" data-delta=\"2\" \/><\/p>\n<div class=\"field field-name-field-file-image-caption field-type-text-long field-label-hidden\">\n<div class=\"field-items\">\n<div class=\"field-item even\">\n<p class=\"rtecenter\"><sup>Logging workflow<\/sup><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>Because logs can hold sensitive information, here are a few security pointers to remember:<\/p>\n<ul>\n<li>Always transport logs over a secure connection.<\/li>\n<li>The logging\/monitoring infrastructure should be implemented inside the restricted subnet.<\/li>\n<li>Access to monitoring user interfaces (e.g., Kibana and\u00a0<a href=\"https:\/\/grafana.com\/\" target=\"_blank\" rel=\"noopener\">Grafana<\/a>) should be restricted or authenticated only to stakeholders.<\/li>\n<\/ul>\n<h2 id=\"node-level-metrics\">Node-level metrics<\/h2>\n<p><em>Not everything is logged!<\/em><\/p>\n<p>Yes, you heard that right: Logging monitors a software or a process, not every component in the infrastructure.<\/p>\n<p>Operating system disks, externally mounted data disks, Elastic Block Store, CPU, I\/O, network packets, inbound and outbound connections, physical memory, virtual memory, buffer space, and queues are some of the major components that rarely appear in logs unless something fails for them.<\/p>\n<p>So, how could you collect this data?<\/p>\n<p><a href=\"https:\/\/prometheus.io\/\" target=\"_blank\" rel=\"noopener\">Prometheus<\/a>\u00a0is one answer. You just need to install software-specific exporters on the virtual machine nodes and configure Prometheus to collect time-based data from those unattended components. Grafana uses the data Prometheus collects to provide a live visual representation of your node&#8217;s current status.<\/p>\n<p>If you are looking for a simpler solution to collect time-series metrics, consider\u00a0<a href=\"https:\/\/www.elastic.co\/products\/beats\/metricbeat\" target=\"_blank\" rel=\"noopener\">Metricbeat<\/a>,\u00a0<a href=\"http:\/\/elastic.io\/\" target=\"_blank\" rel=\"noopener\">Elastic.io<\/a>&#8216;s in-house open source tool, which can be used with Kibana to replace Prometheus and Grafana.<\/p>\n<h2 id=\"alerts-and-notifications\">Alerts and notifications<\/h2>\n<p>You can&#8217;t take advantage of monitoring without alerts and notifications. Unless stakeholders\u2014no matter where they are in this big, big world\u2014receive a notification about an issue, there&#8217;s no way they can analyze and fix the issue, prevent the customer from being impacted, and avoid it in the future.<\/p>\n<p>Prometheus, with predefined alerting rules using its in-house\u00a0<a href=\"https:\/\/prometheus.io\/docs\/alerting\/alertmanager\/\" target=\"_blank\" rel=\"noopener\">Alertmanager<\/a>\u00a0and Grafana, can send alerts based on configured rules.\u00a0<a href=\"https:\/\/sensu.io\/\" target=\"_blank\" rel=\"noopener\">Sensu<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.nagios.org\/\" target=\"_blank\" rel=\"noopener\">Nagios<\/a>\u00a0are other open source tools that offer alerting and monitoring services.<\/p>\n<p>The only problem people have with open source alerting tools is that the configuration time and the process sometimes seem hard, but once they are set up, these tools function better than proprietary alternatives.<\/p>\n<p>However, open source tools&#8217; biggest advantage is that we have control over their behavior.<\/p>\n<h2 id=\"monitoring-workflow-and-architecture\">Monitoring workflow and architecture<\/h2>\n<p>A good monitoring architecture is the backbone of a strong and stable monitoring system. It might look\u00a0something\u00a0like this diagram.<\/p>\n<div class=\"media media-element-container media-default\">\n<div id=\"file-423236\" class=\"file file-image file-image-png\">\n<h2 class=\"element-invisible\"><a href=\"https:\/\/opensource.com\/file\/423236\">image_2_architecture.png<\/a><\/h2>\n<div class=\"content\"><img loading=\"lazy\" decoding=\"async\" class=\"media-element file-default\" title=\"Devops monitoring architecture\" src=\"https:\/\/opensource.com\/sites\/default\/files\/uploads\/image_2_architecture.png\" alt=\"Devops monitoring architecture\" width=\"876\" height=\"1115\" data-delta=\"3\" \/><\/div>\n<\/div>\n<\/div>\n<p>In the end, you must choose a tool based on your needs and infrastructure. The open source tools discussed in this article are used by many organizations for monitoring their infrastructure and blessing it with high uptime.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p><a href=\"http:\/\/lxer.com\/module\/newswire\/ext_link.php?rid=266218\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A strong monitoring and alert system based on open source tools prevents problems before they affect your infrastructure. Infrastructure monitoring is an integral part of infrastructure management. It is an IT manager&#8217;s first line of defense against surprise downtime. Severe issues can inject considerable downtime to live infrastructure, sometimes causing heavy loss of money and &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw92\/index.php\/2019\/02\/24\/infrastructure-monitoring-defense-against-surprise-downtime\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Infrastructure monitoring: Defense against surprise downtime&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-10108","post","type-post","status-publish","format-standard","hentry","category-linux"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/10108","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/comments?post=10108"}],"version-history":[{"count":2,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/10108\/revisions"}],"predecessor-version":[{"id":11518,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/posts\/10108\/revisions\/11518"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/media?parent=10108"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/categories?post=10108"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw92\/index.php\/wp-json\/wp\/v2\/tags?post=10108"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}