Docker Monitoring Continued: Prometheus and Sysdig

I recently compared several docker monitoring tools and services. Since the article went live we have gotten feedback about additional tools that should be included in our survey. I would like to highlight two such tools;
Prometheus and Sysdig cloud. Prometheus is a capable self-hosted
solution which is easier to manage than sensu. Sysdig cloud on the other
hand provides us with another hosted service much like Scout and
Datadog. Collectively they help us add more choices to their respective
classes. As before I will be using the following six criteria to
evaluate Prometheus and Sysdig cloud: 1) ease of deployment, 2) level of
detail of information presented, 3) level of aggregation of information
from entire deployment, 4) ability to raise alerts from the data and 5)
Ability to monitor non-docker resources 6) cost.

Prometheus

First lets take a look at Prometheus; it is a self-hosted set of tools
which collectively provide metrics storage, aggregation, visualization
and alerting. Most of the tools and services we have looked at so far
have been push based, i.e. agents on the monitored servers talk to a
central server (or set of servers) and send out their metrics.
Prometheus on the other hand is a pull based server which expects
monitored servers to provide a web interface from which it can scrape
data. There are several exporters
available for
Prometheus which will capture metrics and then expose them over http for
Prometheus to scrape. In addition there are
libraries which
can be used to create custom exporters. As we are concerned with
monitoring docker containers we will use the
container_exporter
capture metrics. Use the command shown below to bring up the
container-exporter docker container and browse to
http://MONITORED_SERVER_IP:9104/metrics to see the metrics it has
collected for you. You should launch exporters on all servers in your
deployment. Keep track of the respective *MONITORED_SERVER_IP*s as we
will be using them later in the configuration for Prometheus.

docker run -p 9104:9104 -v /sys/fs/cgroup:/cgroup -v /var/run/docker.sock:/var/run/docker.sock prom/container-exporter

Once we have got all our exporters running we are can launch Prometheus
server. However, before we do we need to create a configuration file for
Prometheus that tells the server where to scrape the metrics from.
Create a file called prometheus.conf and then add the following text
inside it.

global:
scrape_interval: 15s
evaluation_interval: 15s
labels:
monitor: exporter-metrics

rule_files:

scrape_configs:
– job_name: prometheus
scrape_interval: 5s

target_groups:
# These endpoints are scraped via HTTP.
– targets: [‘localhost:9090′,’MONITORED_SERVER_IP:9104’]

In this file there are two sections, global and job(s). In the global
section we set defaults for configuration properties such as data
collection interval (scrape_interval). We can also add labels which
will be appended to all metrics. In the jobs section we can define one
or more jobs that each have a name, an optional override scraping
interval as well as one or more targets from which to scrape metrics. We
are adding two targets, one is the Prometheus server itself and the
second is the container-exporter we setup earlier. If you setup more
than one exporter your can setup additional targets to pull metrics from
all of them. Note that the job name is available as a label on the
metric hence you may want to setup separate jobs for your various types
of servers. Now that we have a configuration file we can start a
Prometheus server using the
prom/prometheus
docker image.

docker run -d –name prometheus-server -p 9090:9090 -v $PWD/prometheus.conf:/prometheus.conf prom/prometheus -config.file=/prometheus.conf

After launching the container, Prometheus server should be available in
your browser on the port 9090 in a few moments. Select Graph from the
top menu and select a metric from the drop down box to view its latest
value. You can also write queries in the expression box which can find
matching metrics. Queries take the form
METRIC_NAME. You can find more
details of the query syntax here.

We are able to drill down into the data using queries to filter out data
from specific server types (jobs) and containers. All metrics from
containers are labeled with the image name, container name and the host
on which the container is running. Since metric names do not encompass
container or server name we are able to easily aggregate data across
our deployment. For example we can filter for the
container_memory_usage_bytes to get
information about the memory usage of all ubuntu containers in our
deployment. Using the built in functions we can also aggregate the
resulting set of of metrics. For example
average_over_time(container_memory_usage_bytes
[5m]) will show the memory used by ubuntu
containers, averaged over the last five minutes. Once you are happy with
with a query you can click over to the Graph tab and see the variation
of the metric over time.

Temporary graphs are great for ad-hoc investigations but you also need
to have persistent graphs for dashboards. For this you can use the
Prometheus Dashboard
Builder. To launch
Prometheus Dashboard Builder you need access to an SQL database which
you can create using the official MySQL Docker
image. The command to launch
the MySQL container is shown below, note that you may select any value
for database name, user name, user password and root password however
keep track of these values as they will be needed later.

docker run -p 3306:3306 –name promdash-mysql
-e MYSQL_DATABASE=<database-name>
-e MYSQL_USER=<database-user>
-e MYSQL_PASSWORD=<user-password>
-e MYSQL_ROOT_PASSWORD=<root-password>
-d mysql

Once you have the database setup, use the rake installation inside the
promdash container to initialize the database. You can then run the
Dashboard builder by running the same container. The command to
initialize the database and bring up the Prometheus Dashboard Builder
are shown below.

# Initialize Database
docker run –rm -it –link promdash-mysql:db
-e DATABASE_URL=mysql2://<database-user>:<user-password>@db:3306/<database-name> prom/promdash ./bin/rake db:migrate

# Run Dashboard
docker run -d –link promdash-mysql:db -p 3000:3000 –name prometheus-dash
-e DATABASE_URL=mysql2://<database-user>:<user-password>@db:3306/<database-name> prom/promdash

Once your container is running you can browse to port 3000 and load up
the dashboard builder UI. In the UI you need to click Servers in the
top menu and New Server to add your Prometheus Server as a datasource
for the dashboard builder. Add http://PROMETHEUS_SERVER_IP:9090 to
the list of servers and hit Create Server.

Now click Dashboards in the top menu, here you can create
Directories (Groups of Dashboards) and Dashboards. For example we
created a directory for Web Nodes and one for Database Nodes and in each
we create a dashboard as shown below.

Once you have created a dashboard you can add metrics by mousing over
the title bar of a graph and selecting the data sources icon (Three
Horizontal lines with an addition sign following them ). You can then
select the server which you added earlier, and a query expression which
you tested in the Prometheus Server UI. You can add multiple data
sources into the same graph in order to see a comparative view.

You can add multiple graphs (each with possibly multiple data sources)
by clicking the Add Graph button. In addition you may select the
time range over which your dashboard displays data as well as a refresh
interval for auto-loading data. The dashboard is not as polished as the
ones from Scout and DataDog, for example there is no easy way to explore
metrics or build a query in the dashboard view. Since the dashboard runs
independently of the Prometheus server we can’t ‘pin’ graphs
generated in the Prometheus server into a dashboard. Furthermore several
times we noticed that the UI would not update based on selected data
until we refreshed the page. However, despite its issues the dashboard
is feature competitive with DataDog and because Prometheus is under
heavy development, we expect the bugs to be resolved over time. In
comparison to other self-hosted solutions Prometheus is a lot more user
friendly than Sensu and allows you present metric data as graphs without
using third party visualizations. It also is able to provide much better
analytical capabilities than CAdvisor.

Prometheus also has the ability to apply alerting rules over the input
data and displaying those on the UI. However, to be able to do something
useful with alerts such send emails or notify
pagerduty we need to run the the Alert
Manager. To run
the Alert Manager you first need to create a configuration file. Create
a file called alertmanager.conf and add the following text into it:

notification_config {
name: “ubuntu_notification”
pagerduty_config {
service_key: “<PAGER_DUTY_API_KEY>”
}
email_config {
email: “<TARGET_EMAIL_ADDRESS>”
}
hipchat_config {
auth_token: “<HIPCHAT_AUTH_TOKEN>”
room_id: 123456
}
}
aggregation_rule {
filter {
name_re: “image”
value_re: “ubuntu:14.04”
}
repeat_rate_seconds: 300
notification_config_name: “ubuntu_notification”
}

In this configuration we are creating a notification configuration
called ubuntu_notification, which specifies that alerts must go to
the PagerDuty, Email and HipChat. We need to specify the relevant API
keys and/or access tokens for the HipChat and PagerDutyNotifications to
work. We are also specifying that the alert configuration should only
apply to alerts on metrics where the label image has the value
ubuntu:14.04. We specify that a triggered alert should not retrigger
for at least 300 seconds after the first alert is raised. We can bring
up the Alert Manager using the docker image by volume mounting our
configuration file into the container using the command shown below.

docker run -d -p 9093:9093 -v $PWD:/alertmanager prom/alertmanager -logtostderr -config.file=/alertmanager/alertmanager.conf

Once the container is running you should be able to point your browser
to port 9093 and load up the Alarm Manger UI. You will be able to see
all the alerts raised here, you can ‘silence’ them or delete them once
the issue is resolved. In addition to setting up the Alert Manager we
also need to create a few alerts. Add rule_file:
“/prometheus.rules” in a new line into the global section of the
prometheus.conf file you created earlier. This line tells Prometheus
to look for alerting rules in the prometheus.rules file. We now need
to create the rules file and load it into our server container. To do so
create a file called prometheus.rules in the same directory where you
created prometheus.conf. and add the following text to it:

ALERT HighMemoryAlert
IF container_memory_usage_bytes > 1000000000
FOR 1m
WITH {}
SUMMARY “High Memory usage for Ubuntu container”
DESCRIPTION “High Memory usage for Ubuntu container on {{$labels.instance}} for container {{$labels.name}} (current value: {{$value}})”

In this configuration we are telling Prometheus to raise an alert called
HighMemoryAlert if the container_memory_usage_bytes metric
for containers using the Ubuntu:14.04 image goes above 1 GB for 1
minute. The summary and the description of the alerts is also specified
in the rules file. Both of these fields can contain placeholders for
label values which are replaced by Prometheus. For example our
description will specify the server instance (IP) and the container name
for metric raising the alert. After launching the Alert Manager and
defining your Alert rules, you will need to re-run your Prometheus
server with new parameters. The commands to do so are below:

# stop and remove current container
docker stop prometheus-server && docker rm prometheus-server

# start new container
docker run -d –name prometheus-server -p 9090:9090
-v $PWD/prometheus.conf:/prometheus.conf
-v $PWD/prometheus.rules:/prometheus.rules
prom/prometheus
-config.file=/prometheus.conf
-alertmanager.url=http://ALERT_MANAGER_IP:9093

Once the Prometheus Server is up again you can click Alerts in the top
menu of the Prometheus Server UI to bring up a list of alerts and their
statuses. If and when an alert is fired you will also be able to see it
in the Alert Manager UI and any external service defined in the
alertmanager.conf file.

Collectively the Prometheus tool-set’s feature set is on par with
DataDog which has been our best rated Monitoring tool so far. Prometheus
uses a very simple format for input data and can ingest from any web
endpoint which presents the data. Therefore we can monitor more or less
any resource with Prometheus, and there are already several libraries
defined to monitor common resources. Where Prometheus is lacking is in
level of polish and ease of deployment. The fact that all components are
dockerized is a major plus however, we had to launch 4 different
containers each with their own configuration files to support the
Prometheus server. The project is also lacking detailed, comprehensive
documentation for these various components. However, in caparison to
self-hosted services such as CAdvisor and Sensu, Prometheus is a much
better toolset. It is significantly easier setup than sensu and has the
ability to provide visualization of metrics without third party tools.
It is able has much more detailed metrics than CAdvisor and is also able
to monitor non-docker resources. The choice of using pull based metric
aggregation rather than push is less than ideal as you would have to
restart your server when adding new data sources. This could get
cumbersome in a dynamic environment such as cloud based deployments.
Prometheus does offer the Push
Gateway to bridge the
disconnect. However, running yet another service will add to the
complexity of the setup. For these reasons I still think DataDog is
probably easier for most users, however, with some polish and better
packaging Prometheus could be a very compelling alternative, and out of
self-hosted solutions Prometheus is my pick.

Score Card:

Easy of deployment: **
Level of detail: *****
Level of aggregation: *****
Ability to raise alerts: ****
Ability to monitor non-docker resources: Supported
Cost: Free

Sysdig Cloud

Sysdig cloud is a hosted service that provides metrics storage,
aggregation, visualization and alerting. To get started with sysdig sign
up for a trial account at https://app.sysdigcloud.com. and complete
the registration form. Once you complete the registration form and log
in to the account, you will be asked to Setup your Environment and be
given a curl command similar to the shown below. Your command will have
your own secret key after the -s switch. You can run this command on the
host running docker and which you need to monitor. Note that you should
replace the [TAGS] place holder with tags to group your metrics. The
tags are in the format TAG_NAME:VALUE so you may want to add a tag
role:web or deployment:production. You may also use the containerized
sysdig agent.

# Host install of sysdig agent
curl -s https://s3.amazonaws.com/download.draios.com/stable/install-agent | sudo bash -s 12345678-1234-1234-1234-123456789abc [TAGS]

# Docker based sysdig agent
docker run –name sysdig-agent –privileged –net host
-e ACCESS_KEY=12345678-1234-1234-1234-123456789abc
-e TAGS=os:rancher
-v /var/run/docker.sock:/host/var/run/docker.sock
-v /dev:/host/dev -v /proc:/host/proc:ro
-v /boot:/host/boot:ro
-v /lib/modules:/host/lib/modules:ro
-v /usr:/host/usr:ro sysdig/agent

Even if you use docker you will still need to install Kernel headers in
the host OS. This goes against Docker’s philosophy of isolated micro
services. However, installing kernel headers is fairly benign.
Installing the headers and getting sysdig running is trivial if you are
using a mainstream kernel such us CentOS, Ubuntu or Debian. Even the
Amazon’s custom kernels are supported however RancherOS’s custom
kernel presented problems for sysdig as did the tinycore kernel. So be
warned if you would like to use Sysdig cloud on non-mainstream kernels
you may have to get your hands dirty with some system hacking.

After you run the agent you should see the Host in the Sysdig cloud
console in the Explore tab. Once you launch docker containers on the
host those will also be shown. You can see basic stats about the CPU
usage, memory consumption, network usage. The metrics are aggregated for
the host as well as broken down per container.

By
selecting one of the hosts or containers you can get a whole host of
other metrics including everything provided by the docker stats API. Out
of all the systems we have seen so far sysdig certainly has the most
comprehensive set of metrics out of the box. You can also select from
several pre-configured dashboards which present a graphical or tabular
representation of your deployment.

You can see live metrics, by selecting Real-time Mode (Target Icon)
or select a window of time over which to average values. Furthermore,
you can also setup comparisons which will highlight the delta of current
values and values at a point in the past. For example the table below
shows values compared with those from ten minutes ago. If the CPU usage
is significantly higher than 10 minutes ago you may be experiencing load
spikes and need to scale out. The UI is at par with, if not better than
DataDog for identifying and exploring trends in the data.

In addition to exploring data on an ad-hoc basis you can also create
persistent dashboards. Simply click the pin icon on any graph in the
explore view and save it to a named dashboard. You can view all the
dashboards and their associated graphs by clicking the Dashboards
tab. You can also select the bell icon on any graph and create an
alert from the data. The Sysdig cloud supports detailed alerting
criteria and is again one of the best we have seen. The example below
shows an alert which triggers if the count of containers labeled web
falls below three on average for the last ten minutes. We are also
segmenting the data by the region tag, so there will be a separate
check for web nodes in North America and Europe. Lastly, we also specify
a Name, description and Severity for the alerts. You can control where
alerts go by going to Settings (Gear Icon) > Notifications and add
email addresses or SNS Topics to send alerts too. Note all alerts go to
all notification endpoints which may be problematic if you want to wake
up different people for different alerts.

I am very impressed with Sysdig cloud as it was trivially easy to setup,
provides detailed metrics with great visualization tools for real-time
and historical data. The requirement to install kernel headers on the
host OS is troublesome though and lack of documentation and support for
non-standard kernels could be problematic in some scenarios. The
alerting system in the Sysdig cloud is among the best we have seen so
far, however, the inability to target different email addresses for
different alerts is problematic. In a larger team for example you would
want to alert a different team for database issues vs web server issues.
Lastly, since it is in beta the pricing for Sysdig cloud is not easily
available. I have reached out to their sales team and will update this
article if and when they get back to me. If sysdig is price competitive
then Datadog has serious competition in the hosted service category.

Score Card:

Easy of deployment: ***
Level of detail: *****
Level of aggregation: *****
Ability to raise alerts: ****
Ability to monitor non-docker resources: Supported
Cost: Must Contact Support

Source

Prometheus

Sysdig Cloud

Leave a Reply Cancel reply