{"id":1348,"date":"2019-02-19T13:51:48","date_gmt":"2019-02-19T13:51:48","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw93\/?p=1348"},"modified":"2019-03-07T20:08:34","modified_gmt":"2019-03-07T20:08:34","slug":"docker-monitoring-container-monitoring","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw93\/index.php\/2019\/02\/19\/docker-monitoring-container-monitoring\/","title":{"rendered":"Docker Monitoring | Container Monitoring"},"content":{"rendered":"<h5>A Detailed Overview of Rancher&#8217;s Architecture<\/h5>\n<p>This newly-updated, in-depth guidebook provides a detailed overview of the features and functionality of the new Rancher: an open-source enterprise Kubernetes platform.<\/p>\n<p><a href=\"http:\/\/info.rancher.com\/rancher2-technical-architecture\" target=\"blank\">Get the eBook<\/a><\/p>\n<p><em>Update (October 2017): Gord Sissons revisited this topic and compared<br \/>\nthe top 10 container-monitoring solutions for Rancher in a <a href=\"http:\/\/rancher.com\/comparing-10-container-monitoring-solutions-rancher\/\">recent blog<br \/>\npost<\/a>.<\/em><\/p>\n<p>*Update (October 2016): Our October online meetup demonstrated and<br \/>\ncompared Sysdig, Datadog, and Prometheus in one go. <a href=\"http:\/\/rancher.com\/event\/great-container-monitoring-bake-off-october-online-meetup\/\">Check<br \/>\nout<\/a><br \/>\nthe recording. *<\/p>\n<p>As Docker is used for larger deployments it becomes more important to<br \/>\nget visibility into the status and health of docker environments. In<br \/>\nthis article, I aim to go over some of the common tools used to monitor<br \/>\ncontainers. I will be evaluating these tools based on the following<\/p>\n<p>criteria:<\/p>\n<ol>\n<li>ease of deployment,<\/li>\n<li>level of detail of information presented,<\/li>\n<li>level of aggregation of information from entire deployment,<\/li>\n<li>ability to raise alerts from the data,<\/li>\n<li>ability to monitor non-docker resources, and<\/li>\n<li>cost. This list is by no means comprehensive however I have tried to highlight the most common tools and tools that optimize our six evaluation criteria.<\/li>\n<\/ol>\n<h5>A Detailed Overview of Rancher&#8217;s Architecture<\/h5>\n<p>This newly-updated, in-depth guidebook provides a detailed overview of the features and functionality of the new Rancher: an open-source enterprise Kubernetes platform.<\/p>\n<p><a href=\"http:\/\/info.rancher.com\/rancher2-technical-architecture\" target=\"blank\">Get the eBook<\/a><\/p>\n<h2>Docker Stats<\/h2>\n<p>All commands in this article have been specifically tested on<br \/>\na RancherOS instance running on Amazon Web Services EC2. However, all<br \/>\ntools presented today should be usable on any Docker deployment.<\/p>\n<p>The first tool I will talk about is Docker itself, yes you may not be<br \/>\naware that docker client already provides a rudimentary command line<br \/>\ntool to inspect containers\u2019 resource consumption. To look at the<br \/>\ncontainer stats run <em>docker stats<\/em> with the name(s) of the running<br \/>\ncontainer(s) for which you would like to see stats. This will present<br \/>\nthe CPU utilization for each container, the memory used and total memory<br \/>\navailable to the container. Note that if you have not limited memory for<br \/>\ncontainers this command will post total memory of your host. This does<br \/>\nnot mean each of your container has access to that much memory. In<br \/>\naddition you will also be able to see total data sent and received over<br \/>\nthe network by the container.<\/p>\n<p>$ docker stats determined_shockley determined_wozniak prickly_hypatia<br \/>\nCONTAINER CPU % MEM USAGE\/LIMIT MEM % NET I\/O<br \/>\ndetermined_shockley 0.00% 884 KiB\/1.961 GiB 0.04% 648 B\/648 B<br \/>\ndetermined_wozniak 0.00% 1.723 MiB\/1.961 GiB 0.09% 1.266 KiB\/648 B<br \/>\nprickly_hypatia 0.00% 740 KiB\/1.961 GiB 0.04% 1.898 KiB\/648 B<\/p>\n<p>For a more detailed look at container stats you may also use the Docker<br \/>\nRemote API via netcat (See below). Send an http get request for<br \/>\n<em>\/containers\/[CONTAINER_NAME]\/stats<\/em> where CONTAINER_NAME is name of<br \/>\nthe container for which you want to see stats. You can see an example of<br \/>\nthe complete response for a container stats request<br \/>\n<a href=\"https:\/\/gist.github.com\/usmanismail\/0c4922ffec4a0220d385\">here<\/a>. This<br \/>\nwill present details of the metrics shown above for example you will get<br \/>\ndetails of caches, swap space and other details about memory. You may<br \/>\nwant to peruse the <a href=\"https:\/\/docs.docker.com\/articles\/runmetrics\/%20\">Run<br \/>\nMetrics<\/a> section of the<br \/>\nDocker documentation to get an idea of what the metrics mean.<\/p>\n<p>echo -e &#8220;GET \/containers\/[CONTAINER_NAME]\/stats HTTP\/1.0rn&#8221; | nc -U \/var\/run\/docker.sock<\/p>\n<p>Score Card:<\/p>\n<ol>\n<li>Easy of deployment: *****<\/li>\n<li>Level of detail: *****<\/li>\n<li>Level of aggregation: none<\/li>\n<li>Ability to raise alerts: none<\/li>\n<li>Ability to monitor non-docker resources: none<\/li>\n<li>Cost: Free<\/li>\n<\/ol>\n<h2>CAdvisor<\/h2>\n<p>The docker stats command and the remote API are useful for getting<br \/>\ninformation on the command line, however, if you would like to access<br \/>\nthe information in a graphical interface you will need a tool such as<br \/>\n<a href=\"https:\/\/github.com\/google\/cadvisor\">CAdvisor<\/a>. CAdvisor provides a<br \/>\nvisual representation of the data shown by the docker stats command<br \/>\nearlier. Run the docker command below and go to<br \/>\n<em>http:\/\/&lt;your-hostname&gt;:8080\/<\/em> in the browser of your choice to see<br \/>\nthe CAdvisor interface. You will be shown graphs for overall CPU usage,<br \/>\nMemory usage, Network throughput and disk space utilization. You can<br \/>\nthen drill down into the usage statistics for a specific container by<br \/>\nclicking the <em>Docker Containers<\/em> link at the top of the page and then<br \/>\nselecting the container of your choice. In addition to these statistics<br \/>\nCAdvisor also shows the limits, if any, that are placed on container,<br \/>\nusing the Isolation section.<\/p>\n<p>docker run<br \/>\n&#8211;volume=\/:\/rootfs:ro<br \/>\n&#8211;volume=\/var\/run:\/var\/run:rw<br \/>\n&#8211;volume=\/sys:\/sys:ro<br \/>\n&#8211;volume=\/var\/lib\/docker\/:\/var\/lib\/docker:ro<br \/>\n&#8211;publish=8080:8080<br \/>\n&#8211;detach=true<br \/>\n&#8211;name=cadvisor<br \/>\ngoogle\/cadvisor:latest<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-19-at-11.50.29-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-19-at-11.50.29-PM.png\" alt=\"Screen Shot 2015-03-19 at 11.50.29\nPM\" \/><\/a><\/p>\n<p>CAdvisor is a useful tool that is trivially easy to setup, it saves us<br \/>\nfrom having to ssh into the server to look at resource consumption and<br \/>\nalso produces graphs for us. In addition the pressure gauges provide a<br \/>\nquick overview of when a your cluster needs additional resources.<br \/>\nFurthermore, unlike other options in this article CAdvisor is free as it<br \/>\nis open source and also it runs on hardware already provisioned for your<br \/>\ncluster, other than some processing resources there is no additional<br \/>\ncost of running CAdvisor. However, it has it limitations; it can only<br \/>\nmonitor one docker host and hence if you have a multi-node deployment<br \/>\nyour stats will be disjoint and spread though out your cluster. Note<br \/>\nthat you can use<br \/>\n<a href=\"https:\/\/github.com\/GoogleCloudPlatform\/heapster\">heapster<\/a> to monitor<br \/>\nmultiple nodes if you are running Kubernetes. The data in the charts is<br \/>\na moving window of one minute only and there is no way to look at longer<br \/>\nterm trends. There is no mechanism to kick-off alerting if the resource<br \/>\nusage is at dangerous levels. If you currently do not have any<br \/>\nvisibility in to the resource consumption of your docker node\/cluster<br \/>\nthen CAdvisor is a good first step into container monitoring however, if<br \/>\nyou intend to run any critical tasks on your containers a more robust<br \/>\ntool or approach is needed. Note that<br \/>\n<a href=\"http:\/\/rancher.com\/rancher-io\/\">Rancher<\/a> runs CAdvisor on<br \/>\neach connected host, and exposes a limited set of stats through the UI,<br \/>\nand all of the system stats through the API.<\/p>\n<p>Score Card (Ignoring heapster because only supported on Kubernetes):<\/p>\n<ol>\n<li>Easy of deployment: *****<\/li>\n<li>Level of detail: **<\/li>\n<li>Level of aggregation: *<\/li>\n<li>Ability to raise alerts: none<\/li>\n<li>Ability to monitor non-docker resources: none<\/li>\n<li>Cost: Free<\/li>\n<\/ol>\n<h2>Scout<a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-9.30.08-AM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-9.30.08-AM.png\" alt=\"Screen Shot 2015-03-21 at 9.30.08 AM\" \/><\/a><\/h2>\n<p>The next approach for docker monitoring is Scout and it addresses<br \/>\nseveral of limitations of CAdvisor. Scout is a hosted monitoring service<br \/>\nwhich can aggregate metrics from many hosts and containers and present<br \/>\nthe data over longer time-scales. It can also create alerts based on<br \/>\nthose metrics. The first step to getting scout running is to sign up<br \/>\nfor a Scout account at <a href=\"https:\/\/scoutapp.com\/\">https:\/\/scoutapp.com\/<\/a>, the free trial account<br \/>\nshould be suitable for testing out integration. Once you have created<br \/>\nyour account and logged in, click on your account name in the top right<br \/>\ncorner and then <em>Account Basics<\/em> and take note of your Account Key as<br \/>\nyou will need this to send metrics from our docker server.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/accountid.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/accountid.png\" alt=\"accountid\" \/><\/a>Now<br \/>\non your host, create a file called <em>scoutd.yml<\/em> and copy the following<br \/>\ntext into the the file, replacing the <em>account_key<\/em> with the key you<br \/>\ntook note of earlier. You can specify any values that make sense for the<br \/>\nhost, display_name, environment and roles properties. These will be<br \/>\nused to separate out the metrics when they are presented in the scout<br \/>\ndashboard. I am assuming an array of web-servers is run on docker so<br \/>\nwill use the values shown below.<\/p>\n<p># account_key is the only required value<br \/>\naccount_key: YOUR_ACCOUNT_KEY<br \/>\nhostname: web01-host<br \/>\ndisplay_name: web01<br \/>\nenvironment: production<br \/>\nroles: web<\/p>\n<p>You can now bring up your scout agent with the scout configuration file<br \/>\nby using the docker scout plugin.<\/p>\n<p>docker run -d &#8211;name scout-agent<br \/>\n-v \/proc:\/host\/proc:ro<br \/>\n-v \/etc\/mtab:\/host\/etc\/mtab:ro<br \/>\n-v \/var\/run\/docker.sock:\/host\/var\/run\/docker.sock:ro<br \/>\n-v `pwd`\/scoutd.yml:\/etc\/scout\/scoutd.yml<br \/>\n-v \/sys\/fs\/cgroup\/:\/host\/sys\/fs\/cgroup\/<br \/>\n&#8211;net=host &#8211;privileged<br \/>\nsoutapp\/docker-scout<\/p>\n<p>Now go back to the Scout web view and you should see an entry for your<br \/>\nagent which will be keyed by the display_name parameter (web01) that<br \/>\nyou specified in your scoutd.yml earlier.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-9.58.40-AM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-9.58.40-AM.png\" alt=\"Screen Shot 2015-03-21 at 9.58.40\nAM\" \/><\/a>If<br \/>\nyou click the display name it will display detailed metrics for the<br \/>\nhost. This includes the process count, CPU usage and memory utilization<br \/>\nfor everything running on your host. Note these are not limited to<br \/>\nprocesses running inside docker.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-10.00.47-AM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-10.00.47-AM.png\" alt=\"Screen Shot 2015-03-21 at 10.00.47\nAM\" \/><\/a><\/p>\n<p>To add docker monitoring to your servers click the Roles tab and then<br \/>\nselect <em>All Servers.<\/em> Now click the <em>+ Plugin Template<\/em> Button and then<br \/>\n<em>Docker Monitor<\/em> from the following screen to load the details view.<br \/>\nOnce you have the details view up select Install Plugin to add the<br \/>\nplugin to your hosts. In the following screen give a name to the plugin<br \/>\ninstallation and specify which containers you want to monitor. If you<br \/>\nleave the field blank the plugin will monitor all of the containers on<br \/>\nthe host. Click complete installation and after a minute or so you can<br \/>\ngo to [Server Name] &gt; Plugins to see details from the docker monitor<br \/>\nplugin. The plugin shows the CPU usage, memory usage network throughput<br \/>\nand the number of containers for each host.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-20-at-10.11.06-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-20-at-10.11.06-PM.png\" alt=\"Screen Shot 2015-03-20 at 10.11.06\nPM\" \/><\/a><\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-20-at-10.11.39-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-20-at-10.11.39-PM.png\" alt=\"Screen Shot 2015-03-20 at 10.11.39\nPM\" \/><\/a>If<br \/>\nyou click on any of the graphs you can pull a detailed view of the<br \/>\nmetrics and this view allows you to see the trends in the metric values<br \/>\nacross a longer time span. This view also allows you to filter the<br \/>\nmetrics based on environment and server role. In addition you can create<br \/>\n\u201cTriggers\u201d or alerts to send emails to you if metrics go above or<br \/>\nbelow a configured threshold. This allows you to setup automated alerts<br \/>\nto notify you if for example some of your containers die and the<br \/>\ncontainer count falls below a certain number. You can also setup alerts<br \/>\nfor average CPU utilization so if for example if your containers are<br \/>\nrunning hot you will get an alert and you can launch more add more hosts<br \/>\nto your docker cluster. To create a trigger select <a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.30.25-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.30.25-PM.png\" alt=\"Screen Shot\n2015-03-22 at 6.30.25\nPM\" \/><\/a><em>Roles<\/em><br \/>\n&gt; <em>All Servers<\/em> from the top menu and then <em>docker monitor<\/em> from the<br \/>\nplugins section. Then select <em>triggers<\/em> from the <em>Plugin template<br \/>\nAdministration<\/em> menu on the right hand side of the screen. You should<br \/>\nnow see an option to \u201d<em>Add a Trigger\u201d<\/em> which will apply to the entire<br \/>\ndeployment. Below is an example of a trigger which will send out an<br \/>\nalert if the number of containers in the deployment falls below 3. The<br \/>\nalert was created for \u201cAll Servers\u201d however you could tag your hosts<br \/>\nwith different roles using the scoutd.yml created on the server. Using<br \/>\nthe roles you can apply triggers to a sub-set of the servers on your<br \/>\ndeployment. For example you could setup an alert for when the number of<br \/>\ncontainers on your web nodes falls below a certain number. Even with the<br \/>\nrole based triggers I still feel that Scout alerting could be better.<br \/>\nThis is because many docker deployments have heterogeneous containers on<br \/>\nthe same host. In such a scenario it would be impossible to setup<br \/>\ntriggers for specific types of containers as roles are applied to all<br \/>\ncontainers on the host.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.33.12-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.33.12-PM.png\" alt=\"Screen Shot 2015-03-22 at 6.33.12\nPM\" \/><\/a><\/p>\n<p>Another advantage of using Scout over CAdvisor is that it has a <a href=\"https:\/\/scoutapp.com\/plugin_urls\">large<br \/>\nset of plugins<\/a> which can pull in<br \/>\nother data about your deployment in addition to docker information. This<br \/>\nallows Scout to be your one stop monitoring system instead of having a<br \/>\ndifferent monitoring system for various resources in your system.<\/p>\n<p>One drawback of Scout is that it does not present detailed information<br \/>\nabout individual containers on each host like CAdvisor can. This is<br \/>\nproblematic, if your are running heterogeneous containers on the same<br \/>\nserver. For example if you want a trigger to alert you about issues in<br \/>\nyour web containers but not about your Jenkins containers Scout will not<br \/>\nbe able to support that use case. Despite the drawbacks Scout is a<br \/>\nsignificantly more useful tool for monitoring your docker deployments.<br \/>\nHowever this does come at a cost, ten dollars per monitored host. The<br \/>\ncost could be a factor if you are running a large deployment with many<br \/>\nhosts.<\/p>\n<p>Score Card:<\/p>\n<ol>\n<li>Easy of deployment: ****<\/li>\n<li>Level of detail: **<\/li>\n<li>Level of aggregation: ***<\/li>\n<li>Ability to raise alerts: ***<\/li>\n<li>Ability to monitor non-docker resources: Supported<\/li>\n<li>Cost: $10 \/ host<\/li>\n<\/ol>\n<h2>Data Dog<\/h2>\n<p>From Scout lets move to another monitoring service, DataDog, which<br \/>\naddresses several of the short-comings of Scout as well as all of the<br \/>\nlimitations of CAdvisor. To get started with DataDog, first sign up for<br \/>\na DataDog account at <a href=\"https:\/\/www.datadoghq.com\/\">https:\/\/www.datadoghq.com\/<\/a>. Once you are signed<br \/>\ninto your account you will be presented with list of supported<br \/>\nintegrations with instructions for each type. Select <em>docker<\/em> from the<br \/>\nlist and you will be given a docker run command (show below) to copy<br \/>\ninto your host. The command will have your API key preconfigured and<br \/>\nhence can be run the command as listed. After about 45 seconds your<br \/>\nagent will start reporting metrics to the DataDog system.<\/p>\n<p>docker run -d &#8211;privileged &#8211;name dd-agent<br \/>\n-h `hostname`<br \/>\n-v \/var\/run\/docker.sock:\/var\/run\/docker.sock<br \/>\n-v \/proc\/mounts:\/host\/proc\/mounts:ro<br \/>\n-v \/sys\/fs\/cgroup\/:\/host\/sys\/fs\/cgroup:ro<br \/>\n-e API_KEY=YOUR_API_KEY datadog\/docker-dd-agent<\/p>\n<p>Now that your containers are connected you can go to the <em>Events<\/em> tab in<br \/>\nthe DataDog web console and see all events pertaining to your cluster.<br \/>\nAll container launches and terminations will be part of this event<br \/>\nstream.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.56.04-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.56.04-PM.png\" alt=\"Screen Shot 2015-03-21 at 2.56.04\nPM\" \/><\/a><\/p>\n<p>You can also click the <em>Dashboards<\/em> tab and hit create dashboards to<br \/>\naggregate metrics across your entire cluster. Datadog collects metrics<br \/>\nabout CPU usage, memory and I\/O for all containers running in the<br \/>\nsystem. In addition you get counts of running and stopped containers as<br \/>\nwell as counts of docker images. The dashboard view allows you to create<br \/>\ngraphs for any metric or set of metrics across the entire deployment or<br \/>\ngrouped by host or container image. For example the graph below shows<br \/>\nthe number of running containers broken down by the image type, I am<br \/>\nrunning 9 ubuntu:14.04 containers in my cluster at the moment.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.35.21-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.35.21-PM.png\" alt=\"Screen Shot 2015-03-21 at 2.35.21\nPM\" \/><\/a><br \/>\nYou could also split the same data by Hosts, as the second graph shows,<br \/>\n7 of the containers are running on my Rancher host and the remaining<br \/>\nones on my local laptop.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-3.14.10-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-3.14.10-PM.png\" alt=\"Screen Shot 2015-03-21 at 3.14.10\nPM\" \/><\/a><\/p>\n<p>Data Dog also supports alerting using a feature called *Monitors. *A<br \/>\nmonitor is DataDog\u2019s equivalent to a Scout trigger and allows you to<br \/>\ndefine thresholds for various metrics. DataDog\u2019s alerting system is a<br \/>\nlot more flexible and detailed then Scout\u2019s. The example below shows<br \/>\nhow to specify that you are concerned about Ubuntu containers<br \/>\nterminating hence you would monitor the docker.containers.running metric<br \/>\nfor containers created from the ubuntu:14.04 docker image.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.49.53-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.49.53-PM.png\" alt=\"Screen Shot 2015-03-22 at 6.49.53\nPM\" \/><\/a><\/p>\n<p>Then specify the alert conditions to say that if there are fewer than<br \/>\nten ubuntu containers in our deployment (on average) for the last 5<br \/>\nminutes, you would like to be alerted. Although not shown here you will<br \/>\nalso be asked to specify the text of the message which is sent out when<br \/>\nthis alert is triggered as well as the target audience for this alert.<br \/>\nIn the current example I am using a simple absolute threshold. You can<br \/>\nalso specify a delta based alert which triggers if say the avg stopped<br \/>\ncontainer count was four over the last five minutes raise an alert.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.49.58-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-22-at-6.49.58-PM.png\" alt=\"Screen Shot 2015-03-22 at 6.49.58\nPM\" \/><\/a><\/p>\n<p>Lastly, using the <em>Metrics Explorer<\/em> tab you can make ad-hoc<br \/>\naggregations over your metrics to help debug issues or extract specific<br \/>\ninformation from your data. This view allows your to graph any metric<br \/>\nover a slice based on container image or host. You may combine output<br \/>\ninto a single graph or generate a set of graphs by grouping across<br \/>\nimages or hosts.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.40.30-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-21-at-2.40.30-PM.png\" alt=\"Screen Shot 2015-03-21 at 2.40.30\nPM\" \/><\/a><br \/>\nDataDog is a significant improvement over scout in terms feature set,<br \/>\neasy of use and user friendly design. However this level of polish comes<br \/>\nwith additional cost as each DataDog agent costs $15.<\/p>\n<p>Score Card:<\/p>\n<ol>\n<li>Easy of deployment: *****<\/li>\n<li>Level of detail: *****<\/li>\n<li>Level of aggregation: *****<\/li>\n<li>Ability to raise alerts: Supported<\/li>\n<li>Ability to monitor non-docker resources: *****<\/li>\n<li>Cost: $15 \/<br \/>\nhost<a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/sensu.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/sensu.png\" alt=\"sensu\" \/><\/a><\/li>\n<\/ol>\n<h2>Sensu Monitoring Framework<\/h2>\n<p>Scout and Datadog provide centralized monitoring and alerting however<br \/>\nboth are hosted services that can get expensive for large deployments.<br \/>\nIf you need a self-hosted, centralized metrics service, you may<br \/>\nconsider the <a href=\"http:\/\/sensuapp.org\/\">sensu open source monitoring<br \/>\nframework<\/a>. To run the Sensu server you can use<br \/>\nthe <a href=\"https:\/\/registry.hub.docker.com\/u\/hiroakis\/docker-sensu-server\/\">hiroakis\/docker-sensu-server<\/a><br \/>\ncontainer. This container installs sensu-server, the uchiwa web<br \/>\ninterface, redis, rabbitmq-server, and the sensu-api. Unfortunately<br \/>\nsensu does not have any docker support out of the box. However, using<br \/>\nthe plugin system you can configure support for both container metrics<br \/>\nas well as status checks.<\/p>\n<p>Before launch your sensu server container you must define a check that<br \/>\nyou can load into the server. Create a file called <em>check-docker.json<\/em><br \/>\nand add the following contents into the file. In this file you are<br \/>\ntelling the Sensu server to run a script called <em>load-docker-metrics.sh<\/em><br \/>\nevery ten seconds on all clients which are subscribed to the docker tag.<br \/>\nYou will define this script a little later.<\/p>\n<p>{<br \/>\n&#8220;checks&#8221;: {<br \/>\n&#8220;load_docker_metrics&#8221;: {<br \/>\n&#8220;type&#8221;: &#8220;metric&#8221;,<br \/>\n&#8220;command&#8221;: &#8220;load-docker-metrics.sh&#8221;,<br \/>\n&#8220;subscribers&#8221;: [<br \/>\n&#8220;docker&#8221;<br \/>\n],<br \/>\n&#8220;interval&#8221;: 10<br \/>\n}<br \/>\n}<br \/>\n}<\/p>\n<p>Now you can run the sensu server docker container with our check<br \/>\nconfiguration file using the command below. Once you run the command you<br \/>\nshould be able to launch the uchiwa dashboard at<br \/>\n<a href=\"http:\/\/YOUR_SERVER_IP:3000\">http:\/\/YOUR_SERVER_IP:3000<\/a> in your browser.<\/p>\n<p>docker run -d &#8211;name sensu-server<br \/>\n-p 3000:3000<br \/>\n-p 4567:4567<br \/>\n-p 5671:5671<br \/>\n-p 15672:15672<br \/>\n-v $PWD\/check-docker.json:\/etc\/sensu\/conf.d\/check-docker.json<br \/>\nhiroakis\/docker-sensu-server<\/p>\n<p>Now that the sensu server is up you can launch sensu clients on each of<br \/>\nthe hosts running our docker containers. You told the server that the<br \/>\ncontainers will have a script called <em>load-docker-metrics.sh<\/em> so lets<br \/>\ncreate the script and insert it into our client containers. Create the<br \/>\nfile and add the text shown below into the file, replacing HOST_NAME<br \/>\nwith a logical name for your host . The script below is using the<br \/>\n<a href=\"https:\/\/docs.docker.com\/reference\/api\/docker_remote_api_v1.17\/\">Docker Remote<br \/>\nAPI<\/a>to<br \/>\npull in the meta data for running containers, all containers and all<br \/>\nimages on the host. It then prints the values out using sensu\u2019s key<br \/>\nvalue notation. The sensu server will read the output values from the<br \/>\nSTDOUT and collect those metrics. This example only pulls these three<br \/>\nvalues but you could make the script as detailed as required. Note that<br \/>\nyou could also add multiple check scripts such as thos, as long as you<br \/>\nreference them in the server configuration file you created earlier. You<br \/>\ncan also define that you want the check to fail if the number of running<br \/>\ncontainers ever falls below three. You can make a check fail by<br \/>\nreturning a non-zero value from the check script.<\/p>\n<p>#!\/bin\/bash<br \/>\nset -e<\/p>\n<p># Count all running containers<br \/>\nrunning_containers=$(echo -e &#8220;GET \/containers\/json HTTP\/1.0rn&#8221; | nc -U \/var\/run\/docker.sock<br \/>\n| tail -n +5<br \/>\n| python -m json.tool<br \/>\n| grep &#8220;Id&#8221;<br \/>\n| wc -l)<br \/>\n# Count all containers<br \/>\ntotal_containers=$(echo -e &#8220;GET \/containers\/json?all=1 HTTP\/1.0rn&#8221; | nc -U \/var\/run\/docker.sock<br \/>\n| tail -n +5<br \/>\n| python -m json.tool<br \/>\n| grep &#8220;Id&#8221;<br \/>\n| wc -l)<\/p>\n<p># Count all images<br \/>\ntotal_images=$(echo -e &#8220;GET \/images\/json HTTP\/1.0rn&#8221; | nc -U \/var\/run\/docker.sock<br \/>\n| tail -n +5<br \/>\n| python -m json.tool<br \/>\n| grep &#8220;Id&#8221;<br \/>\n| wc -l)<\/p>\n<p>echo &#8220;docker.HOST_NAME.running_containers $&#8221;<br \/>\necho &#8220;docker.HOST_NAME.total_containers $&#8221;<br \/>\necho &#8220;docker.HOST_NAME.total_images $&#8221;<\/p>\n<p>if [ $ -lt 3 ]; then<br \/>\nexit 1;<br \/>\nfi<\/p>\n<p>Now that you have defined your load docker metrics check you need to to<br \/>\nstart the sensu client using the<br \/>\n<a href=\"https:\/\/registry.hub.docker.com\/u\/usman\/sensu-client\">usman\/sensu-client<\/a><br \/>\ncontainer I defined for this purpose. You can use the command shown<br \/>\nbelow to launch sensu client. Note that the container must run as<br \/>\nprivileged in order to be able to access unix sockets, it must have the<br \/>\ndocker socket mounted in as a volume as well as the<br \/>\n<em>load-docker-metrics.sh<\/em> script you defined above. Make sure the<br \/>\nload-docker-metrics.sh script is marked as executable in your host<br \/>\nmachine as the permissions carry through into the container. The<br \/>\ncontainer also takes in SENSU_SERVER_IP, RABIT_MQ_USER,<br \/>\nRABIT_MQ_PASSWORD, CLIENT_NAME and CLIENT_IP as parameters, please<br \/>\nspecify the value of these parameters for your setup. The default values<br \/>\nfor the RABIT_MQ_USER RABIT_MQ_PASSWORD are <em>sensu<\/em> and <em>password<\/em>.<\/p>\n<p>docker run -d &#8211;name sensu-client &#8211;privileged<br \/>\n-v $PWD\/load-docker-metrics.sh:\/etc\/sensu\/plugins\/load-docker-metrics.sh<br \/>\n-v \/var\/run\/docker.sock:\/var\/run\/docker.sock<br \/>\nusman\/sensu-client SENSU_SERVER_IP RABIT_MQ_USER RABIT_MQ_PASSWORD CLIENT_NAME CLIENT_IP<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.13.48-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.13.48-PM.png\" alt=\"Screen Shot 2015-03-28 at 10.13.48\nPM\" \/><\/a><\/p>\n<p>A few seconds after running this command you should see the client<br \/>\ncount increase to 1 in the uchiwa dashboard. If you click the clients<br \/>\nicon you should see a list of your clients including the client that you<br \/>\njust added. I named my client <em>client-1<\/em> and specified the host IP as<br \/>\n192.168.1.1.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.13.54-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.13.54-PM.png\" alt=\"Screen Shot 2015-03-28 at 10.13.54\nPM\" \/><\/a><\/p>\n<p>If you click on the client name your should get further details of the<br \/>\nchecks. You can see that the load_docker_metrics check was run at<br \/>\n10:22 on the 28th of March.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.14.00-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.14.00-PM.png\" alt=\"Screen Shot 2015-03-28 at 10.14.00\nPM\" \/><\/a>If<br \/>\nyou Click on the check name you can see further details of check runs.<br \/>\nThe zeros indicate that there were no errors, if the script had failed<br \/>\n(if for example your docker Daemon dies) you would see an error code<br \/>\n(non zero) value. Although it is not covered this in the current article<br \/>\nyou can also setup sensu to alert you when these checks fail using<br \/>\n<a href=\"http:\/\/sensuapp.org\/docs\/0.11\/adding_a_handler\">Handlers<\/a>. Furthermore,<br \/>\nuchiwa only shows the values of checks and not the metrics collected.<br \/>\nNote that sensu does not store the collected metrics, they have to be<br \/>\nforwarded to a time series database such as InfluxDB or Graphite. This<br \/>\nis also done through Handlers. Please find details of how to configure<br \/>\nmetric forwarding to graphite<br \/>\n<a href=\"http:\/\/www.joemiller.me\/2012\/02\/02\/sensu-and-graphite\/\">here<\/a>.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.27.59-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/03\/13030215\/Screen-Shot-2015-03-28-at-10.27.59-PM.png\" alt=\"Screen Shot 2015-03-28 at 10.27.59\nPM\" \/><\/a><\/p>\n<p>Sensu ticks all the boxes in our evaluation criteria; you can collect as<br \/>\nmuch detail about our docker containers and hosts as ypu want. In<br \/>\naddition you are able to aggregate the values of all of out hosts in one<br \/>\nplace and raise alerts over those checks. The alerting is not as<br \/>\nadvanced as DataDog or Scout, as you are only able to alert on checks<br \/>\nfailing on individual hosts. However, the big drawback of Sensu is<br \/>\ndifficulty of deployment. Although I have automated many steps in the<br \/>\ndeployment using docker containers, Sensu remains a complicated system<br \/>\nrequiring us to install, launch and maintain separate processes for<br \/>\nRedis, RabitMQ, Sensu API, uchiwa and Sensu Core. Furthermore, you would<br \/>\nrequire still more tools such as Graphite to present metric values and a<br \/>\nproduction deployment would require customizing the containers I have<br \/>\nused today for secure passwords and custom ssl certificates. In addition<br \/>\nwere you to add more checks after launching the container you would have<br \/>\nto restart the Sensu server as that is the only way for it start<br \/>\ncollecting new metrics. For these reasons I rate Sensu fairly low for<br \/>\nease of deployment.<\/p>\n<p>Easy of deployment: * Level of detail: **** Level of aggregation:<br \/>\n**** Ability to raise alerts: Supported but limited Ability to<br \/>\nmonitor non-docker resources: ***** Cost: $Free<\/p>\n<p><em>I also evaluated two other monitoring services, Prometheus and Sysdig<br \/>\nCloud in a<br \/>\n<a href=\"http:\/\/rancher.com\/docker-monitoring-continued-prometheus-and-sysdig\/\">second article<\/a>,<br \/>\nand have included them in this post for simplicity.<\/em><\/p>\n<h2>Prometheus<\/h2>\n<p>First lets take a look at Prometheus; it is a self-hosted set of tools<br \/>\nwhich collectively provide metrics storage, aggregation, visualization<br \/>\nand alerting. Most of the tools and services we have looked at so far<br \/>\nhave been push based, i.e. agents on the monitored servers talk to a<br \/>\ncentral server (or set of servers) and send out their metrics.<br \/>\nPrometheus on the other hand is a pull based server which expects<br \/>\nmonitored servers to provide a web interface from which it can scrape<br \/>\ndata. There are several <a href=\"http:\/\/prometheus.io\/docs\/instrumenting\/exporters\/\">exporters<br \/>\navailable<\/a> for<br \/>\nPrometheus which will capture metrics and then expose them over http for<br \/>\nPrometheus to scrape. In addition there are<br \/>\n<a href=\"http:\/\/prometheus.io\/docs\/instrumenting\/clientlibs\/\">libraries<\/a> which<br \/>\ncan be used to create custom exporters. As we are concerned with<br \/>\nmonitoring docker containers we will use the<br \/>\n<a href=\"https:\/\/github.com\/docker-infra\/container_exporter\">container_exporter<\/a><br \/>\ncapture metrics. Use the command shown below to bring up the<br \/>\ncontainer-exporter docker container and browse to<br \/>\n<em><a href=\"http:\/\/MONITORED_SERVER_IP:9104\/metrics\">http:\/\/MONITORED_SERVER_IP:9104\/metrics<\/a><\/em> to see the metrics it has<br \/>\ncollected for you. You should launch exporters on all servers in your<br \/>\ndeployment. Keep track of the respective *MONITORED_SERVER_IP*s as we<br \/>\nwill be using them later in the configuration for Prometheus.<\/p>\n<p>docker run -p 9104:9104 -v \/sys\/fs\/cgroup:\/cgroup -v \/var\/run\/docker.sock:\/var\/run\/docker.sock prom\/container-exporter<\/p>\n<p>Once we have got all our exporters running we are can launch Prometheus<br \/>\nserver. However, before we do we need to create a configuration file for<br \/>\nPrometheus that tells the server where to scrape the metrics from.<br \/>\nCreate a file called <em>prometheus.conf<\/em> and then add the following text<br \/>\ninside it.<\/p>\n<p>global:<br \/>\nscrape_interval: 15s<br \/>\nevaluation_interval: 15s<br \/>\nlabels:<br \/>\nmonitor: exporter-metrics<\/p>\n<p>rule_files:<\/p>\n<p>scrape_configs:<br \/>\n&#8211; job_name: prometheus<br \/>\nscrape_interval: 5s<\/p>\n<p>target_groups:<br \/>\n# These endpoints are scraped via HTTP.<br \/>\n&#8211; targets: [&#8216;localhost:9090&#8242;,&#8217;MONITORED_SERVER_IP:9104&#8217;]<\/p>\n<p>In this file there are two sections, global and job(s). In the global<br \/>\nsection we set defaults for configuration properties such as data<br \/>\ncollection interval (scrape_interval). We can also add labels which<br \/>\nwill be appended to all metrics. In the jobs section we can define one<br \/>\nor more jobs that each have a name, an optional override scraping<br \/>\ninterval as well as one or more targets from which to scrape metrics. We<br \/>\nare adding two targets, one is the Prometheus server itself and the<br \/>\nsecond is the container-exporter we setup earlier. If you setup more<br \/>\nthan one exporter your can setup additional targets to pull metrics from<br \/>\nall of them. Note that the job name is available as a label on the<br \/>\nmetric hence you may want to setup separate jobs for your various types<br \/>\nof servers. Now that we have a configuration file we can start a<br \/>\nPrometheus server using the<br \/>\n<a href=\"https:\/\/registry.hub.docker.com\/u\/prom\/prometheus\/\">prom\/prometheus<\/a><br \/>\ndocker image.<\/p>\n<p>docker run -d &#8211;name prometheus-server -p 9090:9090 -v $PWD\/prometheus.conf:\/prometheus.conf prom\/prometheus -config.file=\/prometheus.conf<\/p>\n<p>After launching the container, Prometheus server should be available in<br \/>\nyour browser on the port 9090 in a few moments. Select <em>Graph<\/em> from the<br \/>\ntop menu and select a metric from the drop down box to view its latest<br \/>\nvalue. You can also write queries in the expression box which can find<br \/>\nmatching metrics. Queries take the form<br \/>\nMETRIC_NAME. You can find more<br \/>\ndetails of the query syntax <a href=\"http:\/\/prometheus.io\/docs\/querying\/\">here<\/a>.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/k0n8b9k.png\" alt=\"\" \/><\/p>\n<p>We are able to drill down into the data using queries to filter out data<br \/>\nfrom specific server types (jobs) and containers. All metrics from<br \/>\ncontainers are labeled with the image name, container name and the host<br \/>\non which the container is running. Since metric names do not encompass<br \/>\ncontainer or server name we are able to easily aggregate data across<br \/>\nour deployment. For example we can filter for the<br \/>\ncontainer_memory_usage_bytes to get<br \/>\ninformation about the memory usage of all ubuntu containers in our<br \/>\ndeployment. Using the built in functions we can also aggregate the<br \/>\nresulting set of of metrics. For example<br \/>\naverage_over_time(container_memory_usage_bytes<br \/>\n[5m]) will show the memory used by ubuntu<br \/>\ncontainers, averaged over the last five minutes. Once you are happy with<br \/>\nwith a query you can click over to the Graph tab and see the variation<br \/>\nof the metric over time.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/DgrLYLl.png\" alt=\"\" \/><\/p>\n<p>Temporary graphs are great for ad-hoc investigations but you also need<br \/>\nto have persistent graphs for dashboards. For this you can use the<br \/>\n<a href=\"https:\/\/registry.hub.docker.com\/u\/prom\/promdash\/\">Prometheus Dashboard<br \/>\nBuilder<\/a>. To launch<br \/>\nPrometheus Dashboard Builder you need access to an SQL database which<br \/>\nyou can create using the official MySQL <a href=\"https:\/\/registry.hub.docker.com\/_\/mysql\/\">Docker<br \/>\nimage<\/a>. The command to launch<br \/>\nthe MySQL container is shown below, note that you may select any value<br \/>\nfor database name, user name, user password and root password however<br \/>\nkeep track of these values as they will be needed later.<\/p>\n<p>docker run -p 3306:3306 &#8211;name promdash-mysql<br \/>\n-e MYSQL_DATABASE=&lt;database-name&gt;<br \/>\n-e MYSQL_USER=&lt;database-user&gt;<br \/>\n-e MYSQL_PASSWORD=&lt;user-password&gt;<br \/>\n-e MYSQL_ROOT_PASSWORD=&lt;root-password&gt;<br \/>\n-d mysql<\/p>\n<p>Once you have the database setup, use the rake installation inside the<br \/>\npromdash container to initialize the database. You can then run the<br \/>\nDashboard builder by running the same container. The command to<br \/>\ninitialize the database and bring up the Prometheus Dashboard Builder<br \/>\nare shown below.<\/p>\n<p># Initialize Database<br \/>\ndocker run &#8211;rm -it &#8211;link promdash-mysql:db<br \/>\n-e DATABASE_URL=mysql2:\/\/&lt;database-user&gt;:&lt;user-password&gt;@db:3306\/&lt;database-name&gt; prom\/promdash .\/bin\/rake db:migrate<\/p>\n<p># Run Dashboard<br \/>\ndocker run -d &#8211;link promdash-mysql:db -p 3000:3000 &#8211;name prometheus-dash<br \/>\n-e DATABASE_URL=mysql2:\/\/&lt;database-user&gt;:&lt;user-password&gt;@db:3306\/&lt;database-name&gt; prom\/promdash<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/JCwRhwx.png\" alt=\"\" \/><\/p>\n<p>Once your container is running you can browse to port 3000 and load up<br \/>\nthe dashboard builder UI. In the UI you need to click <em>Servers<\/em> in the<br \/>\ntop menu and <em>New Server<\/em> to add your Prometheus Server as a datasource<br \/>\nfor the dashboard builder. Add <em><a href=\"http:\/\/PROMETHEUS_SERVER_IP:9090\">http:\/\/PROMETHEUS_SERVER_IP:9090<\/a><\/em> to<br \/>\nthe list of servers and hit <em>Create Server<\/em>.<\/p>\n<p>Now click <em>Dashboards<\/em> in the top menu, here you can create<br \/>\n<em>Directories<\/em> (Groups of Dashboards) and <em>Dashboards<\/em>. For example we<br \/>\ncreated a directory for Web Nodes and one for Database Nodes and in each<br \/>\nwe create a dashboard as shown below.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/ntOQORp.png\" alt=\"\" \/><\/p>\n<p>Once you have created a dashboard you can add metrics by mousing over<br \/>\nthe title bar of a graph and selecting the data sources icon (Three<br \/>\nHorizontal lines with an addition sign following them ). You can then<br \/>\nselect the server which you added earlier, and a query expression which<br \/>\nyou tested in the Prometheus Server UI. You can add multiple data<br \/>\nsources into the same graph in order to see a comparative view.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/BQM3rkG.png\" alt=\"\" \/><\/p>\n<p>You can add multiple graphs (each with possibly multiple data sources)<br \/>\nby clicking the Add Graph button. In addition you may select the<br \/>\ntime range over which your dashboard displays data as well as a refresh<br \/>\ninterval for auto-loading data. The dashboard is not as polished as the<br \/>\nones from Scout and DataDog, for example there is no easy way to explore<br \/>\nmetrics or build a query in the dashboard view. Since the dashboard runs<br \/>\nindependently of the Prometheus server we can\u2019t \u2018pin\u2019 graphs<br \/>\ngenerated in the Prometheus server into a dashboard. Furthermore several<br \/>\ntimes we noticed that the UI would not update based on selected data<br \/>\nuntil we refreshed the page. However, despite its issues the dashboard<br \/>\nis feature competitive with DataDog and because Prometheus is under<br \/>\nheavy development, we expect the bugs to be resolved over time. In<br \/>\ncomparison to other self-hosted solutions Prometheus is a lot more user<br \/>\nfriendly than Sensu and allows you present metric data as graphs without<br \/>\nusing third party visualizations. It also is able to provide much better<br \/>\nanalytical capabilities than CAdvisor.<\/p>\n<p>Prometheus also has the ability to apply alerting rules over the input<br \/>\ndata and displaying those on the UI. However, to be able to do something<br \/>\nuseful with alerts such send emails or notify<br \/>\n<a href=\"http:\/\/www.pagerduty.com\/\">pagerduty<\/a> we need to run the the <a href=\"https:\/\/registry.hub.docker.com\/u\/prom\/alertmanager\/\">Alert<br \/>\nManager<\/a>. To run<br \/>\nthe Alert Manager you first need to create a configuration file. Create<br \/>\na file called <em>alertmanager.conf<\/em> and add the following text into it:<\/p>\n<p>notification_config {<br \/>\nname: &#8220;ubuntu_notification&#8221;<br \/>\npagerduty_config {<br \/>\nservice_key: &#8220;&lt;PAGER_DUTY_API_KEY&gt;&#8221;<br \/>\n}<br \/>\nemail_config {<br \/>\nemail: &#8220;&lt;TARGET_EMAIL_ADDRESS&gt;&#8221;<br \/>\n}<br \/>\nhipchat_config {<br \/>\nauth_token: &#8220;&lt;HIPCHAT_AUTH_TOKEN&gt;&#8221;<br \/>\nroom_id: 123456<br \/>\n}<br \/>\n}<br \/>\naggregation_rule {<br \/>\nfilter {<br \/>\nname_re: &#8220;image&#8221;<br \/>\nvalue_re: &#8220;ubuntu:14.04&#8221;<br \/>\n}<br \/>\nrepeat_rate_seconds: 300<br \/>\nnotification_config_name: &#8220;ubuntu_notification&#8221;<br \/>\n}<\/p>\n<p>In this configuration we are creating a notification configuration<br \/>\ncalled <em>ubuntu_notification<\/em>, which specifies that alerts must go to<br \/>\nthe PagerDuty, Email and HipChat. We need to specify the relevant API<br \/>\nkeys and\/or access tokens for the HipChat and PagerDutyNotifications to<br \/>\nwork. We are also specifying that the alert configuration should only<br \/>\napply to alerts on metrics where the label image has the value<br \/>\nubuntu:14.04. We specify that a triggered alert should not retrigger<br \/>\nfor at least 300 seconds after the first alert is raised. We can bring<br \/>\nup the Alert Manager using the docker image by volume mounting our<br \/>\nconfiguration file into the container using the command shown below.<\/p>\n<p>docker run -d -p 9093:9093 -v $PWD:\/alertmanager prom\/alertmanager -logtostderr -config.file=\/alertmanager\/alertmanager.conf<\/p>\n<p>Once the container is running you should be able to point your browser<br \/>\nto port 9093 and load up the Alarm Manger UI. You will be able to see<br \/>\nall the alerts raised here, you can \u2018silence\u2019 them or delete them once<br \/>\nthe issue is resolved. In addition to setting up the Alert Manager we<br \/>\nalso need to create a few alerts. Add rule_file:<br \/>\n\u201c\/prometheus.rules\u201d in a new line into the global section of the<br \/>\n<em>prometheus.conf<\/em> file you created earlier. This line tells Prometheus<br \/>\nto look for alerting rules in the <em>prometheus.rules<\/em> file. We now need<br \/>\nto create the rules file and load it into our server container. To do so<br \/>\ncreate a file called <em>prometheus.rules<\/em> in the same directory where you<br \/>\ncreated <em>prometheus.conf<\/em>. and add the following text to it:<\/p>\n<p>ALERT HighMemoryAlert<br \/>\nIF container_memory_usage_bytes &gt; 1000000000<br \/>\nFOR 1m<br \/>\nWITH {}<br \/>\nSUMMARY &#8220;High Memory usage for Ubuntu container&#8221;<br \/>\nDESCRIPTION &#8220;High Memory usage for Ubuntu container on {{$labels.instance}} for container {{$labels.name}} (current value: {{$value}})&#8221;<\/p>\n<p>In this configuration we are telling Prometheus to raise an alert called<br \/>\nHighMemoryAlert if the container_memory_usage_bytes metric<br \/>\nfor containers using the Ubuntu:14.04 image goes above 1 GB for 1<br \/>\nminute. The summary and the description of the alerts is also specified<br \/>\nin the rules file. Both of these fields can contain placeholders for<br \/>\nlabel values which are replaced by Prometheus. For example our<br \/>\ndescription will specify the server instance (IP) and the container name<br \/>\nfor metric raising the alert. After launching the Alert Manager and<br \/>\ndefining your Alert rules, you will need to re-run your Prometheus<br \/>\nserver with new parameters. The commands to do so are below:<\/p>\n<p># stop and remove current container<br \/>\ndocker stop prometheus-server &amp;&amp; docker rm prometheus-server<\/p>\n<p># start new container<br \/>\ndocker run -d &#8211;name prometheus-server -p 9090:9090<br \/>\n-v $PWD\/prometheus.conf:\/prometheus.conf<br \/>\n-v $PWD\/prometheus.rules:\/prometheus.rules<br \/>\nprom\/prometheus<br \/>\n-config.file=\/prometheus.conf<br \/>\n-alertmanager.url=http:\/\/ALERT_MANAGER_IP:9093<\/p>\n<p>Once the Prometheus Server is up again you can click Alerts in the top<br \/>\nmenu of the Prometheus Server UI to bring up a list of alerts and their<br \/>\nstatuses. If and when an alert is fired you will also be able to see it<br \/>\nin the Alert Manager UI and any external service defined in the<br \/>\n<em>alertmanager.conf<\/em> file.<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/i.imgur.com\/uC483G5.png\" alt=\"\" \/><\/p>\n<p>Collectively the Prometheus tool-set\u2019s feature set is on par with<br \/>\nDataDog which has been our best rated Monitoring tool so far. Prometheus<br \/>\nuses a very simple format for input data and can ingest from any web<br \/>\nendpoint which presents the data. Therefore we can monitor more or less<br \/>\nany resource with Prometheus, and there are already several libraries<br \/>\ndefined to monitor common resources. Where Prometheus is lacking is in<br \/>\nlevel of polish and ease of deployment. The fact that all components are<br \/>\ndockerized is a major plus however, we had to launch 4 different<br \/>\ncontainers each with their own configuration files to support the<br \/>\nPrometheus server. The project is also lacking detailed, comprehensive<br \/>\ndocumentation for these various components. However, in caparison to<br \/>\nself-hosted services such as CAdvisor and Sensu, Prometheus is a much<br \/>\nbetter toolset. It is significantly easier setup than sensu and has the<br \/>\nability to provide visualization of metrics without third party tools.<br \/>\nIt is able has much more detailed metrics than CAdvisor and is also able<br \/>\nto monitor non-docker resources. The choice of using pull based metric<br \/>\naggregation rather than push is less than ideal as you would have to<br \/>\nrestart your server when adding new data sources. This could get<br \/>\ncumbersome in a dynamic environment such as cloud based deployments.<br \/>\nPrometheus does offer the <a href=\"https:\/\/github.com\/prometheus\/pushgateway\">Push<br \/>\nGateway<\/a> to bridge the<br \/>\ndisconnect. However, running yet another service will add to the<br \/>\ncomplexity of the setup. For these reasons I still think DataDog is<br \/>\nprobably easier for most users, however, with some polish and better<br \/>\npackaging Prometheus could be a very compelling alternative, and out of<br \/>\nself-hosted solutions Prometheus is my pick.<\/p>\n<p>Score Card:<\/p>\n<ol>\n<li>Easy of deployment: **<\/li>\n<li>Level of detail: *****<\/li>\n<li>Level of aggregation: *****<\/li>\n<li>Ability to raise alerts: ****<\/li>\n<li>Ability to monitor non-docker resources: Supported<\/li>\n<li>Cost: Free<\/li>\n<\/ol>\n<h2>Sysdig Cloud<\/h2>\n<p>Sysdig cloud is a hosted service that provides metrics storage,<br \/>\naggregation, visualization and alerting. To get started with sysdig sign<br \/>\nup for a trial account at <a href=\"https:\/\/app.sysdigcloud.com\">https:\/\/app.sysdigcloud.com<\/a>. and complete<br \/>\nthe registration form. Once you complete the registration form and log<br \/>\nin to the account, you will be asked to <em>Setup your Environment<\/em> and be<br \/>\ngiven a curl command similar to the shown below. Your command will have<br \/>\nyour own secret key after the -s switch. You can run this command on the<br \/>\nhost running docker and which you need to monitor. Note that you should<br \/>\nreplace the [TAGS] place holder with tags to group your metrics. The<br \/>\ntags are in the format TAG_NAME:VALUE so you may want to add a tag<br \/>\nrole:web or deployment:production. You may also use the containerized<br \/>\nsysdig agent.<\/p>\n<p># Host install of sysdig agent<br \/>\ncurl -s https:\/\/s3.amazonaws.com\/download.draios.com\/stable\/install-agent | sudo bash -s 12345678-1234-1234-1234-123456789abc [TAGS]<\/p>\n<p># Docker based sysdig agent<br \/>\ndocker run &#8211;name sysdig-agent &#8211;privileged &#8211;net host<br \/>\n-e ACCESS_KEY=12345678-1234-1234-1234-123456789abc<br \/>\n-e TAGS=os:rancher<br \/>\n-v \/var\/run\/docker.sock:\/host\/var\/run\/docker.sock<br \/>\n-v \/dev:\/host\/dev -v \/proc:\/host\/proc:ro<br \/>\n-v \/boot:\/host\/boot:ro<br \/>\n-v \/lib\/modules:\/host\/lib\/modules:ro<br \/>\n-v \/usr:\/host\/usr:ro sysdig\/agent<\/p>\n<p>Even if you use docker you will still need to install Kernel headers in<br \/>\nthe host OS. This goes against Docker\u2019s philosophy of isolated micro<br \/>\nservices. However, installing kernel headers is fairly benign.<br \/>\nInstalling the headers and getting sysdig running is trivial if you are<br \/>\nusing a mainstream kernel such us CentOS, Ubuntu or Debian. Even the<br \/>\nAmazon\u2019s custom kernels are supported however RancherOS\u2019s custom<br \/>\nkernel presented problems for sysdig as did the tinycore kernel. So be<br \/>\nwarned if you would like to use Sysdig cloud on non-mainstream kernels<br \/>\nyou may have to get your hands dirty with some system hacking.<\/p>\n<p>After you run the agent you should see the Host in the Sysdig cloud<br \/>\nconsole in the Explore tab. Once you launch docker containers on the<br \/>\nhost those will also be shown. You can see basic stats about the CPU<br \/>\nusage, memory consumption, network usage. The metrics are aggregated for<br \/>\nthe host as well as broken down per container.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-14-at-12.06.36-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-14-at-12.06.36-PM.png\" alt=\"Screen Shot 2015-04-14 at 12.06.36\nPM\" \/><\/a>By<br \/>\nselecting one of the hosts or containers you can get a whole host of<br \/>\nother metrics including everything provided by the docker stats API. Out<br \/>\nof all the systems we have seen so far sysdig certainly has the most<br \/>\ncomprehensive set of metrics out of the box. You can also select from<br \/>\nseveral pre-configured dashboards which present a graphical or tabular<br \/>\nrepresentation of your deployment.<\/p>\n<p><a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-16-at-11.26.53-AM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-16-at-11.26.53-AM.png\" alt=\"Screen Shot 2015-04-16 at 11.26.53\nAM\" \/><\/a><\/p>\n<p>You can see live metrics, by selecting Real-time Mode (Target Icon)<br \/>\nor select a window of time over which to average values. Furthermore,<br \/>\nyou can also setup comparisons which will highlight the delta of current<br \/>\nvalues and values at a point in the past. For example the table below<br \/>\nshows values compared with those from ten minutes ago. If the CPU usage<br \/>\nis significantly higher than 10 minutes ago you may be experiencing load<br \/>\nspikes and need to scale out. The UI is at par with, if not better than<br \/>\nDataDog for identifying and exploring trends in the data.<a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-19-at-4.59.09-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-19-at-4.59.09-PM.png\" alt=\"Screen Shot\n2015-04-19 at 4.59.09\nPM\" \/><\/a><\/p>\n<p>In addition to exploring data on an ad-hoc basis you can also create<br \/>\npersistent dashboards. Simply click the pin icon on any graph in the<br \/>\nexplore view and save it to a named dashboard. You can view all the<br \/>\ndashboards and their associated graphs by clicking the Dashboards<br \/>\ntab. You can also select the bell icon on any graph and create an<br \/>\nalert from the data. The Sysdig cloud supports detailed alerting<br \/>\ncriteria and is again one of the best we have seen. The example below<br \/>\nshows an alert which triggers if the count of containers labeled <em>web<\/em><br \/>\nfalls below three on average for the last ten minutes. We are also<br \/>\nsegmenting the data by the <em>region<\/em> tag, so there will be a separate<br \/>\ncheck for web nodes in North America and Europe. Lastly, we also specify<br \/>\na Name, description and Severity for the alerts. You can control where<br \/>\nalerts go by going to Settings (Gear Icon) &gt; Notifications and add<br \/>\nemail addresses or SNS Topics to send alerts too. Note all alerts go to<br \/>\nall notification endpoints which may be problematic if you want to wake<br \/>\nup different people for different alerts.<a href=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-19-at-4.55.35-PM.png\"><img decoding=\"async\" src=\"http:\/\/cdn.rancher.com\/wp-content\/uploads\/2015\/04\/20181622\/Screen-Shot-2015-04-19-at-4.55.35-PM.png\" alt=\"Screen Shot 2015-04-19 at\n4.55.35\nPM\" \/><\/a><\/p>\n<p>I am very impressed with Sysdig cloud as it was trivially easy to setup,<br \/>\nprovides detailed metrics with great visualization tools for real-time<br \/>\nand historical data. The requirement to install kernel headers on the<br \/>\nhost OS is troublesome though and lack of documentation and support for<br \/>\nnon-standard kernels could be problematic in some scenarios. The<br \/>\nalerting system in the Sysdig cloud is among the best we have seen so<br \/>\nfar, however, the inability to target different email addresses for<br \/>\ndifferent alerts is problematic. In a larger team for example you would<br \/>\nwant to alert a different team for database issues vs web server issues.<br \/>\nLastly, since it is in beta the pricing for Sysdig cloud is not easily<br \/>\navailable. I have reached out to their sales team and will update this<br \/>\narticle if and when they get back to me. If sysdig is price competitive<br \/>\nthen Datadog has serious competition in the hosted service category.<\/p>\n<p>Score Card:<\/p>\n<ol>\n<li>Easy of deployment: ***<\/li>\n<li>Level of detail: *****<\/li>\n<li>Level of aggregation: *****<\/li>\n<li>Ability to raise alerts: ****<\/li>\n<li>Ability to monitor non-docker resources: Supported<\/li>\n<li>Cost: Must Contact Support<\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>Today\u2019s article has covered several options for monitoring docker<br \/>\ncontainers, ranging from free options; docker stats, CAdvisor,<br \/>\nPrometheus or Sensu to paid services such as Scout, Sysdig Cloud and<br \/>\nDataDog. From my research so far DataDog seems to be the best-in-class<br \/>\nsystem for monitoring docker deployments. The setup was complete in<br \/>\nseconds with a one-line command, all hosts were reporting metrics in one<br \/>\nplace, historical trends were apparent in the UI and Datadog supports<br \/>\ndeep diving into metrics as well as alerting. However, at $15 per host<br \/>\nthe system can get expensive for large deployments. For larger scale,<br \/>\nself-hosted deployments Sensu is able to fulfill most requirements<br \/>\nhowever the complexity in setting up and managing a Sensu cluster may be<br \/>\nprohibitive. Obviously, there are plenty of other self-hosted options,<br \/>\nsuch as Nagios or Icinga, which are similar to Sensu.<\/p>\n<p>Hopefully this gives you an idea of some of the options for monitoring<br \/>\ncontainers available today. I am continuing to investigate other<br \/>\noptions, including a more streamlined self-managed container monitoring<br \/>\nsystem using CollectD, Graphite or InfluxDB and Grafana. Stay tuned for<br \/>\nmore details.<\/p>\n<p>ADDITIONAL INFORMATION: After publishing this article I had some<br \/>\nsuggestions to also evaluate Prometheus and Sysdig Cloud, two other very<br \/>\ngood options for monitoring Docker. We\u2019ve now included them in this<br \/>\narticle, for ease of discovery. You can find the original second part<br \/>\nof my<br \/>\npost<a href=\"http:\/\/rancher.com\/docker-monitoring-continued-prometheus-and-sysdig\/\">here<\/a>.<\/p>\n<p>To learn more about monitoring and managing Docker, please join us for<br \/>\nour next Rancher online meetup.<\/p>\n<p><em>Usman is a server and infrastructure engineer, with experience in<br \/>\nbuilding large scale distributed services on top of various cloud<br \/>\nplatforms. You can read more of his work at<br \/>\n<a href=\"http:\/\/techtraits.com\/\">techtraits.com<\/a>, or follow him on twitter<br \/>\n<a href=\"https:\/\/twitter.com\/usman_ismail\">@usman_ismail<\/a>or<br \/>\non<a href=\"https:\/\/github.com\/usmanismail\">GitHub<\/a>.<\/em><\/p>\n<p><a href=\"https:\/\/rancher.com\/comparing-monitoring-options-for-docker-deployments\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Detailed Overview of Rancher&#8217;s Architecture This newly-updated, in-depth guidebook provides a detailed overview of the features and functionality of the new Rancher: an open-source enterprise Kubernetes platform. Get the eBook Update (October 2017): Gord Sissons revisited this topic and compared the top 10 container-monitoring solutions for Rancher in a recent blog post. *Update (October &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw93\/index.php\/2019\/02\/19\/docker-monitoring-container-monitoring\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Docker Monitoring | Container Monitoring&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-1348","post","type-post","status-publish","format-standard","hentry","category-kubernetes"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/comments?post=1348"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1348\/revisions"}],"predecessor-version":[{"id":1423,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1348\/revisions\/1423"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/media?parent=1348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/categories?post=1348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/tags?post=1348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}