kube-proxy Subtleties: Debugging an Intermittent Connection Reset

I recently came across a bug that causes intermittent connection resets. After some digging, I found it was caused by a subtle combination of several different network subsystems. It helped me understand Kubernetes networking better, and I think it’s worthwhile to share with a wider audience who are interested in the same topic.

The symptom

We received a user report claiming they were getting connection resets while using a Kubernetes service of type ClusterIP to serve large files to pods running in the same cluster. Initial debugging of the cluster did not yield anything interesting: network connectivity was fine and downloading the files did not hit any issues. However, when we ran the workload in parallel across many clients, we were able to reproduce the problem. Adding to the mystery was the fact that the problem could not be reproduced when the workload was run using VMs without Kubernetes. The problem, which could be easily reproduced by a simple app, clearly has something to do with Kubernetes networking, but what?

Kubernetes networking basics

Before digging into this problem, let’s talk a little bit about some basics of Kubernetes networking, as Kubernetes handles network traffic from a pod very differently depending on different destinations.


In Kubernetes, every pod has its own IP address. The benefit is that the applications running inside pods could use their canonical port, instead of remapping to a different random port. Pods have L3 connectivity between each other. They can ping each other, and send TCP or UDP packets to each other. CNI is the standard that solves this problem for containers running on different hosts. There are tons of different plugins that support CNI.


For the traffic that goes from pod to external addresses, Kubernetes simply uses SNAT. What it does is replace the pod’s internal source IP:port with the host’s IP:port. When the return packet comes back to the host, it rewrites the pod’s IP:port as the destination and sends it back to the original pod. The whole process is transparent to the original pod, who doesn’t know the address translation at all.


Pods are mortal. Most likely, people want reliable service. Otherwise, it’s pretty much useless. So Kubernetes has this concept called “service” which is simply a L4 load balancer in front of pods. There are several different types of services. The most basic type is called ClusterIP. For this type of service, it has a unique VIP address that is only routable inside the cluster.

The component in Kubernetes that implements this feature is called kube-proxy. It sits on every node, and programs complicated iptables rules to do all kinds of filtering and NAT between pods and services. If you go to a Kubernetes node and type iptables-save, you’ll see the rules that are inserted by Kubernetes or other programs. The most important chains are KUBE-SERVICESKUBE-SVC-* and KUBE-SEP-*.

  • KUBE-SERVICES is the entry point for service packets. What it does is to match the destination IP:port and dispatch the packet to the corresponding KUBE-SVC-* chain.
  • KUBE-SVC-* chain acts as a load balancer, and distributes the packet to KUBE-SEP-* chain equally. Every KUBE-SVC-* has the same number ofKUBE-SEP-* chains as the number of endpoints behind it.
  • KUBE-SEP-* chain represents a Service EndPoint. It simply does DNAT, replacing service IP:port with pod’s endpoint IP:Port.

For DNAT, conntrack kicks in and tracks the connection state using a state machine. The state is needed because it needs to remember the destination address it changed to, and changed it back when the returning packet came back. Iptables could also rely on the conntrack state (ctstate) to decide the destiny of a packet. Those 4 conntrack states are especially important:

  • NEW: conntrack knows nothing about this packet, which happens when the SYN packet is received.
  • ESTABLISHED: conntrack knows the packet belongs to an established connection, which happens after handshake is complete.
  • RELATED: The packet doesn’t belong to any connection, but it is affiliated to another connection, which is especially useful for protocols like FTP.
  • INVALID: Something is wrong with the packet, and conntrack doesn’t know how to deal with it. This state plays a centric role in this Kubernetes issue.

Here is a diagram of how a TCP connection works between pod and service. The sequence of events are:

  • Client pod from left hand side sends a packet to a service:
  • The packet is going through iptables rules in client node and the destination is changed to pod IP,
  • Server pod handles the packet and sends back a packet with destination
  • The packet is going back to the client node, conntrack recognizes the packet and rewrites the source address back to
  • Client pod receives the response packet
Good packet flow
Good packet flow

What caused the connection reset?

Enough of the background, so what really went wrong and caused the unexpected connection reset?

As the diagram below shows, the problem is packet 3. When conntrack cannot recognize a returning packet, and mark it as INVALID. The most common reasons include: conntrack cannot keep track of a connection because it is out of capacity, the packet itself is out of a TCP window, etc. For those packets that have been marked as INVALID state by conntrack, we don’t have the iptables rule to drop it, so it will be forwarded to client pod, with source IP address not rewritten (as shown in packet 4)! Client pod doesn’t recognize this packet because it has a different source IP, which is pod IP, not service IP. As a result, client pod says, “Wait a second, I don’t recall this connection to this IP ever existed, why does this dude keep sending this packet to me?” Basically, what the client does is simply send a RST packet to the server pod IP, which is packet 5. Unfortunately, this is a totally legit pod-to-pod packet, which can be delivered to server pod. Server pod doesn’t know all the address translations that happened on the client side. From its view, packet 5 is a totally legit packet, like packet 2 and 3. All server pod knows is, “Well, client pod doesn’t want to talk to me, so let’s close the connection!” Boom! Of course, in order for all these to happen, the RST packet has to be legit too, with the right TCP sequence number, etc. But when it happens, both parties agree to close the connection.

Connection reset packet flow
Connection reset packet flow

How to address it?

Once we understand the root cause, the fix is not hard. There are at least 2 ways to address it.

  • Make conntrack more liberal on packets, and don’t mark the packets as INVALID. In Linux, you can do this by echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal.
  • Specifically add an iptables rule to drop the packets that are marked as INVALID, so it won’t reach to client pod and cause harm.

The fix is drafted (https://github.com/kubernetes/kubernetes/pull/74840), but unfortunately it didn’t catch the v1.14 release window. However, for the users that are affected by this bug, there is a way to mitigate the problem by applying the following rule in your cluster.

apiVersion: extensions/v1beta1
kind: DaemonSet
  name: startup-script
    app: startup-script
        app: startup-script
      hostPID: true
      - name: startup-script
        image: gcr.io/google-containers/startup-script:v1
        imagePullPolicy: IfNotPresent
          privileged: true
        - name: STARTUP_SCRIPT
          value: |
            #! /bin/bash
            echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
            echo done


Obviously, the bug has existed almost forever. I am surprised that it hasn’t been noticed until recently. I believe the reasons could be: (1) this happens more in a congested server serving large payloads, which might not be a common use case; (2) the application layer handles the retry to be tolerant of this kind of reset. Anyways, regardless of how fast Kubernetes has been growing, it’s still a young project. There are no other secrets than listening closely to customers’ feedback, not taking anything for granted but digging deep, we can make it the best platform to run applications.


Istio monitoring explained

Istio monitoring explained

Nobody would be surprised if I say “Service Mesh” is a trending topic in the tech community these days. One of the most active projects in this area is Istio. It was jointly created by IBM, Google, and Lyft as a response to known problems with microservice architectures. Containers and Kubernetes greatly help with adopting a microservices architecture. However, at the same time, they bring a new set of new problems we didn’t have before.

Nowadays, all our services use HTTP/gRPC APIs to communicate between themselves. In the old monolithic times, these were just function calls flowing through a single application. This means, in a microservice system, that there are a large number of interactions between services which makes observability, security, and monitoring harder.

There are already a lot of resources that explain what Istio looks like and how it works. I don’t want to repeat those here, so I am going to focus on one area – monitoring. The official documentation covers this but understanding it took me some time. So in this tutorial, I will guide you through it. So you can gain a deeper understanding of using Istio for monitoring tasks.

State of the art

One of the main characteristics of why a service mesh is chosen is to improve the observability. Up to now, developers had to instrument their applications to expose a series of metrics, often using a common library or a vendor’s agent like New Relic or Datadog. Afterwards, operators were able to scrape the application’s metric endpoints using a monitoring solution getting a picture of how the system was behaving. But having to modify the code is a pain, especially when there are many changes or additions. And scaling this approach to multiple teams can make it hard to maintain.

The Istio approach is to expose and track application behaviour without touching a single line of code. This is achieved thanks to the ‘sidecar’ concept, which is a container that runs alongside our applications and supplies data to a central telemetry component. The sidecars can sniff a lot of information about the requests, thanks to being able to recognise the protocol being used (redis, mongo, http, grpc, etc.).

Mixer, the Swiss Army Knife

Let’s start by explaining the Mixer component. What it does and what benefits does it bring to monitoring. In my opinion, the best way to define ‘Mixer’ is by visualizing it as an attribute processor. Every proxy in the mesh sends a different set of attributes, like request data or environment information, and ‘Mixer’ processes all this data and routes it to the right adapters.

An ‘adapter’ is a handler which is attached to the ‘Mixer‘ and is in charge of adapting the attribute data for a backend. A backend could be whichever external service is interested in this data. For example, a monitoring tool (like Prometheus or Stackdriver), an authorization backend, or a logging stack.

A diagram how Mixer works


One of the hardest things when entering the Istio world is getting familiar with the new terminology. Just when you think you’ve understood the entire Kubernetes glossary, you realize Istio adds more than fifty new terms to the arena!

Focussing on monitoring, let’s describe the most interesting concepts that will help us benefit from the mixer design:

  • Attribute: A piece of data that is processed by the mixer. Most of the time this comes from a sidecar but it can be produced by an adapter too. Attributes are used in the Instance to map the desired data to the backend.
  • Adapter: Logic embedded in the mixer component which manages the forwarding of data to a specific backend.
  • Handler: Configuration of an adapter. As an adapter can serve multiple use cases, the configuration is decoupled making it possible to run the same adapter with multiple settings.
  • Instance: Is the entity which binds the data coming from Istio to the adapter model. Istio has a unified set of attributes collected by its sidecar containers. This data has to be translated into the backend language.
  • Template: A common interface to define the instance templates. https://istio.io/docs/reference/config/policy-and-telemetry/templates/

Creating a new monitoring case

After defining all the concepts around Istio observability, the best way to embed it in our minds is with a real-world scenario.

For this exercise, I thought it would be great to get the benefits from Kubernetes labels metadata and thanks to it, track the versioning of our services. It is a common situation when you’re moving to a microservice architecture to end up having multiple versions of your services (A/B testing, API versioning, etc). The Istio sidecar sends all kinds of metadata from your cluster to the mixer. So in our example, we will leverage the deployment labels to identify the service version and observe the usage stats for each version.

For the sake of simplicity let’s take an existing project, the Google microservices demo project, and make some modifications to match our plan. This project simulates a microservice architecture composed of multiple components to build an e-commerce website.

First things first, let’s ensure the project runs correctly in our cluster with Istio. Let’s use the auto-injection feature to deploy all the components in a namespace and have the sidecar injected automatically by Istio.

$ kubectl label namespace mesh istio-injection=enabled

Warning: Ensure mesh namespace is created beforehand and your kubectl context point to it.

If you have a pod security policy enabled, you will need to configure some permissions for the init container in order to let it configure the iptables magic correctly. For testing purposes you can use:

$ kubectl create clusterrolebinding mesh –clusterrole cluster-admin –serviceaccount=mesh:default

This binds the default service account to the cluster admin role. Now we can deploy all the components using the all-in resources YAML document.

$ kubectl apply -f release/kubernetes-manifests.yaml

Now you should be able to see pods starting in the mesh namespace. Some of them will fail because the Istio resources are not yet added. For example, egress traffic will not be allowed and the currency component will fail. Apply these resources to fix the problem and expose the frontend component through the Istio ingress.

$ kubectl apply -f release/istio-manifests.yaml

Now, we can browse to see the frontend using the IP or domain supplied by your cloud provider (the frontend-external service is exposed via the cloud provider load balancer).

As we have now our microservices application running, let’s go a step further and configure one of the components to have multiple versions. As you can see in the microservices YAML, the deployment has a single label with the application name. If we want to manage canary deployments or run multiple versions of our app we could add another label with the versioning.

apiVersion: extensions/v1beta1
kind: Deployment
name: currencyservice
app: currencyservice
version: v1

After applying the changes to our cluster, we can duplicate the deployment with a different name and changing the version.

apiVersion: extensions/v1beta1
kind: Deployment
name: currencyservice2
app: currencyservice
version: v2

And now submit it to the API again.

$ kubectl apply -f release/kubernetes-manifests.yaml

Note: Although we apply again all the manifests, only the ones that have changed will be updated by the API.

An avid reader has noticed that we did a trick because the service selector only points to the app label. That way the traffic will be split between the versions equitably.

From the ground to the sky

Now let’s add the magic. We will need to create three resources to expose the version as a new metric in prometheus.

First, we’ll create the instance. Here we use the metric instance template to map the values provider by the sidecars to the adapter inputs. We are only interested in the workload name (source) and the version.

apiVersion: “config.istio.io/v1alpha2”
kind: metric
name: versioncount
namespace: mesh
value: “1”
source: source.workload.name | “unknown”
version: destination.labels[“version”] | “unknown”
monitored_resource_type: ‘”UNSPECIFIED”‘

Now let’s configure the adapter. In our case we want to connect the metric to a Prometheus backend. So we’ll define the metric name and the type of value the metric that will serve to the backend (Prometheus DSL) in the handler configuration. Also the label names it will use for the dimensions.

apiVersion: “config.istio.io/v1alpha2”
kind: prometheus
name: versionhandler
namespace: mesh
– name: version_count # Prometheus metric name
instance_name: versioncount.metric.mesh # Mixer instance name (fully-qualified)
– source
– version

Finally, we’ll need to link this particular handler with a specific instance (metric).

apiVersion: “config.istio.io/v1alpha2”
kind: rule
name: versionprom
namespace: mesh
match: destination.service == “currencyservice
– handler: versionhandler.prometheus
– versioncount.metric.mesh

Once those definitions are applied, Istio will instruct the prometheus adapter to start collect and serve the new metric. If we take a look at the prometheus UI now searching for the new metric, we should be able to see something like:

Prometheus version's graph


Good observability in a microservice architecture is not easy. Istio can help to remove the complexity from developers and leave the work to the operator.

At the beginning it may be hard to deal with all the complexity added by a service mesh. But once you’ve tamed it, you’ll be able to standardize and automate your monitoring configuration and build a great observability system in record time.

Further information


An Introduction to Big Data Concepts

An Introduction to Big Data Concepts

An Introduction to Big Data Concepts

Gigantic amounts of data are being generated at high speeds by a variety of sources such as mobile devices, social media, machine logs, and multiple sensors surrounding us. All around the world, we produce vast amount of data and the volume of generated data is growing exponentially at a unprecedented rate. The pace of data generation is even being accelerated by the growth of new technologies and paradigms such as Internet of Things (IoT).

What is Big Data and How Is It Changing?

The definition of big data is hidden in the dimensions of the data. Data sets are considered “big data” if they have a high degree of the following three distinct dimensions: volume, velocity, and variety. Value and veracity are two other “V” dimensions that have been added to the big data literature in the recent years. Additional Vs are frequently proposed, but these five Vs are widely accepted by the community and can be described as follows:

  • Velocity: the speed at which the data is been generated
  • Volume: the amount of the data that is been generated
  • Variety: the diversity or different types of the data
  • Value: the worth of the data or the value it has
  • Veracity: the quality, accuracy, or trustworthiness of the data

Large volumes of data are generally available in either structured or unstructured formats. Structured data can be generated by machines or humans, has a specific schema or model, and is usually stored in databases. Structured data is organized around schemas with clearly defined data types. Numbers, date time, and strings are a few examples of structured data that may be stored in database columns. Alternatively, unstructured data does not have a predefined schema or model. Text files, log files, social media posts, mobile data, and media are all examples of unstructured data.

Based on a report provided by Gartner, an international research and consulting organization, the application of advanced big data analytics is part of the Gartner Top 10 Strategic Technology Trends for 2019, and is expected to drive new business opportunities. The same report also predicts that more than 40% of data science tasks will be automated by 2020, which will likely require new big data tools and paradigms.

By 2017, global internet usage reached 47% of the world’s population based on an infographic provided by DOMO. This indicates that an increasing number of people are starting to use mobile phones and that more and more devices are being connected to each other via smart cities, wearable devices, Internet of Things (IoT), fog computing, and edge computing paradigms. As internet usage spikes and other technologies such as social media, IoT devices, mobile phones, autonomous devices (e.g. robotics, drones, vehicles, appliances, etc) continue to grow, our lives will become more connected than ever and generate unprecedented amounts of data, all of which will require new technologies for processing.

The Scale of Data Generated by Everyday Interactions

At a large scale, the data generated by everyday interactions is staggering. Based on research conducted by DOMO, for every minute in 2018, Google conducted 3,877,140 searches, YouTube users watched 4,333,560 videos, Twitter users sent 473,400 tweets, Instagram users posted 49,380 photos, Netflix users streamed 97,222 hours of video, and Amazon shipped 1,111 packages. This is just a small glimpse of a much larger picture involving other sources of big data. It seems like the internet is pretty busy, does not it? Moreover, it is expected that mobile traffic will experience tremendous growth past its present numbers and that the world’s internet population is growing significantly year-over-year. By 2020, the report anticipates that 1.7MB of data will be created per person per second. Big data is getting even bigger.

At small scale, the data generated on a daily basis by a small business, a start up company, or a single sensor such as a surveillance camera is also huge. For example, a typical IP camera in a surveillance system at a shopping mall or a university campus generates 15 frame per second and requires roughly 100 GB of storage per day. Consider the storage amount and computing requirements if those camera numbers are scaled to tens or hundreds.

Big Data in the Scientific Community

Scientific projects such as CERN, which conducts research on what the universe is made of, also generate massive amounts of data. The Large Hadron Collider (LHC) at CERN is the world’s largest and most powerful particle accelerator. It consists of a 27-kilometer ring of superconducting magnets along with some additional structures to accelerate and boost the energy of particles along the way.

During the spin, particles collide with LHC detectors roughly 1 billion times per second, which generates around 1 petabyte of raw digital “collision event” data per second. This unprecedented volume of data is a great challenge that cannot be resolved with CERN’s current infrastructure. To work around this, the generated raw data is filtered and only the “important” events are processed to reduce the volume of data. Consider the challenging processing requirements for this task.

The four big LHC experiments, named ALICEATLASCMS, and LHCb, are among the biggest generators of data at CERN, and the rate of the data processed and stored on servers by these experiments is expected to reach about 25 GB/s (gigabyte per second). As of June 29, 2017, the CERN Data Center announced that they had passed the 200 petabytes milestone of data archived permanently in their storage units.

Why Big Data Tools are Required

The scale of the data generated by famous well-known corporations, small scale organizations, and scientific projects is growing at an unprecedented level. This can be clearly seen by the above scenarios and by remembering again that the scale of this data is getting even bigger.

On the one hand, the mountain of the data generated presents tremendous processing, storage, and analytics challenges that need to be carefully considered and handled. On the other hand, traditional Relational Database Management Systems (RDBMS) and data processing tools are not sufficient to manage this massive amount of data efficiently when the scale of data reaches terabytes or petabytes. These tools lack the ability to handle large volumes of data efficiently at scale. Fortunately, big data tools and paradigms such as Hadoop and MapReduce are available to resolve these big data challenges.

Analyzing big data and gaining insights from it can help organizations make smart business decisions and improve their operations. This can be done by uncovering hidden patterns in the data and using them to reduce operational costs and increase profits. Because of this, big data analytics plays a crucial role for many domains such as healthcare, manufacturing, and banking by resolving data challenges and enabling them to move faster.

Big Data Analytics Tools

Since the compute, storage, and network requirements for working with large data sets are beyond the limits of a single computer, there is a need for paradigms and tools to crunch and process data through clusters of computers in a distributed fashion. More and more computing power and massive storage infrastructure are required for processing this massive data either on-premise or, more typically, at the data centers of cloud service providers.

In addition to the required infrastructure, various tools and components must be brought together to solve big data problems. The Hadoop ecosystem is just one of the platforms helping us work with massive amounts of data and discover useful patterns for businesses.

Below is a list of some of the tools available and a description of their roles in processing big data:

  • MapReduce: MapReduce is a distributed computing paradigm developed to process vast amount of data in parallel by splitting a big task into smaller map and reduce oriented tasks.
  • HDFS: The Hadoop Distributed File System is a distributed storage and file system used by Hadoop applications.
  • YARN: The resource management and job scheduling component in the Hadoop ecosystem.
  • Spark: A real-time in-memory data processing framework.
  • PIG/HIVE: SQL-like scripting and querying tools for data processing and simplifying the complexity of MapReduce programs.
  • HBase, MongoDB, Elasticsearch: Examples of a few NoSQL databases.
  • Mahout, Spark ML: Tools for running scalable machine learning algorithms in a distributed fashion.
  • Flume, Sqoop, Logstash: Data integration and ingestion of structured and unstructured data.
  • Kibana: A tool to visualize Elasticsearch data.


To summarize, we are generating a massive amount of data in our everyday life, and that number is continuing to rise. Having the data alone does not improve an organization without analyzing and discovering its value for business intelligence. It is not possible to mine and process this mountain of data with traditional tools, so we use big data pipelines to help us ingest, process, analyze, and visualize these tremendous amounts of data.

Learn to deploy databases in production on Kubernetes

For more training in big data and database management, watch our free online training on successfully running a database in production on kubernetes.

Running Kubernetes locally on Linux with Minikube – now with Kubernetes 1.14 support

  • A few days ago, the Kubernetes community announced Kubernetes 1.14, the most recent version of Kubernetes. Alongside it, Minikube, a part of the Kubernetes project, recently hit the 1.0 milestone, which supports Kubernetes 1.14 by default.

    Kubernetes is a real winner (and a de facto standard) in the world of distributed Cloud Native computing. While it can handle up to 5000 nodes in a single cluster, local deployment on a single machine (e.g. a laptop, a developer workstation, etc.) is an increasingly common scenario for using Kubernetes.

    This is post #1 in a series about the local deployment options on Linux, and it will cover Minikube, the most popular community-built solution for running Kubernetes on a local machine.

    Minikube is a cross-platform, community-driven Kubernetes distribution, which is targeted to be used primarily in local environments. It deploys a single-node cluster, which is an excellent option for having a simple Kubernetes cluster up and running on localhost.

    Minikube is designed to be used as a virtual machine (VM), and the default VM runtime is VirtualBox. At the same time, extensibility is one of the critical benefits of Minikube, so it’s possible to use it with drivers outside of VirtualBox.

    By default, Minikube uses Virtualbox as a runtime for running the virtual machine. Virtualbox is a cross-platform solution, which can be used on a variety of operating systems, including GNU/Linux, Windows, and macOS.

    At the same time, QEMU/KVM is a Linux-native virtualization solution, which may offer benefits compared to Virtualbox. For example, it’s much easier to use KVM on a GNU/Linux server, so you can run a single-node Minikube cluster not only on a Linux workstation or laptop with GUI, but also on a remote headless server.

    Unfortunately, Virtualbox and KVM can’t be used simultaneously, so if you are already running KVM workloads on a machine and want to run Minikube there as well, using the KVM minikube driver is the preferred way to go.

    In this guide, we’ll focus on running Minikube with the KVM driver on Ubuntu 18.04 (I am using a bare metal machine running on packet.com.)

    Minikube architecture (source: kubernetes.io)
    Minikube architecture (source: kubernetes.io)


    This is not an official guide to Minikube. You may find detailed information on running and using Minikube on it’s official webpage, where different use cases, operating systems, environments, etc. are covered. Instead, the purpose of this guide is to provide clear and easy guidelines for running Minikube with KVM on Linux.


    • Any Linux you like (in this tutorial we’ll use Ubuntu 18.04 LTS, and all the instructions below are applicable to it. If you prefer using a different Linux distribution, please check out the relevant documentation)
    • libvirt and QEMU-KVM installed and properly configured
    • The Kubernetes CLI (kubectl) for operating the Kubernetes cluster

    QEMU/KVM and libvirt installation

    NOTE: skip if already installed

    Before we proceed, we have to verify if our host can run KVM-based virtual machines. This can be easily checked using the kvm-ok tool, available on Ubuntu.

    sudo apt install cpu-checker && sudo kvm-ok

    If you receive the following output after running kvm-ok, you can use KVM on your machine (otherwise, please check out your configuration):

    $ sudo kvm-ok
    INFO: /dev/kvm exists
    KVM acceleration can be used

    Now let’s install KVM and libvirt and add our current user to the libvirt group to grant sufficient permissions:

    sudo apt install libvirt-clients libvirt-daemon-system qemu-kvm \
        && sudo usermod -a -G libvirt $(whoami) \
        && newgrp libvirt

    After installing libvirt, you may verify the host validity to run the virtual machines with virt-host-validate tool, which is a part of libvirt.

    sudo virt-host-validate

    kubectl (Kubernetes CLI) installation

    NOTE: skip if already installed

    In order to manage the Kubernetes cluster, we need to install kubectl, the Kubernetes CLI tool.

    The recommended way to install it on Linux is to download the pre-built binary and move it to a directory under the $PATH.

    curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl \
        && sudo install kubectl /usr/local/bin && rm kubectl

    Alternatively, kubectl can be installed with a big variety of different methods (eg. as a .deb or snap package – check out the kubectl documentation to find the best one for you).

    Minikube installation

    Minikube KVM driver installation

    A VM driver is an essential requirement for local deployment of Minikube. As we’ve chosen to use KVM as the Minikube driver in this tutorial, let’s install the KVM driver with the following command:

    curl -LO https://storage.googleapis.com/minikube/releases/latest/docker-machine-driver-kvm2 \
        && sudo install docker-machine-driver-kvm2 /usr/local/bin/ && rm docker-machine-driver-kvm2

    Minikube installation

    Now let’s install Minikube itself:

    curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
        && sudo install minikube-linux-amd64 /usr/local/bin/minikube && rm minikube-linux-amd64

    Verify the Minikube installation

    Before we proceed, we need to verify that Minikube is correctly installed. The simplest way to do this is to check Minikube’s status.

    minikube version

    To use the KVM2 driver:

    Now let’s run the local Kubernetes cluster with Minikube and KVM:

    minikube start --vm-driver kvm2

    Set KVM2 as a default VM driver for Minikube

    If KVM is used as the single driver for Minikube on our machine, it’s more convenient to set it as a default driver and run Minikube with fewer command-line arguments. The following command sets the KVM driver as the default:

    minikube config set vm-driver kvm2

    So now let’s run Minikube as usual:

    minikube start

    Verify the Kubernetes installation

    Let’s check if the Kubernetes cluster is up and running:

    kubectl get nodes

    Now let’s run a simple sample app (nginx in our case):

    kubectl create deployment nginx --image=nginx

    Let’s also check that the Kubernetes pods are correctly provisioned:

    kubectl get pods



    Next steps

    At this point, a Kubernetes cluster with Minikube and KVM is adequately set up and configured on your local machine.

    To proceed, you may check out the Kubernetes tutorials on the project website:

    It’s also worth checking out the “Introduction to Kubernetes” course by The Linux Foundation/Cloud Native Computing Foundation, available for free on EDX:


Install OpenShift in a container with Weave Footloose

In this tutorial we will install OpenShift in a container using a new tool called footloose by Weaveworks.

Footloose is a tool built by Weaveworks which builds and runs a container with systemd installed. It can be created in a similar way to a VM but without the overheads.

I wrote this tutorial because I wanted a light-weight environment for testing the OpenFaaS project on OpenShift Origin 3.10. An alternative distribution for testing is Minishift which also allows you to run OpenShift locally, but in a much more heavy-weight VM.

Install Footloose

You can use a Linux machine or MacOS host for this tutorial. ARM and Raspberry Pi are not supported.

  • Install Footloose

Follow the instructions on the official website:


  • Create a config
  name: cluster
  privateKey: cluster-key
- count: 1
    image: quay.io/footloose/centos7:0.3.0
    name: os%d
    privileged: true
    - containerPort: 22
    - containerPort: 8443
      hostPort: 8443
    - containerPort: 53
      hostPort: 53
    - containerPort: 443
      hostPort: 443
    - containerPort: 80
      hostPort: 80
    - type: volume
      destination: /var/lib/docker


Note the additional ports 8443 and 53 used by OpenShift Origin and then 80 and 443 are bound for exposing your projects.

If you already have services bound to 80/443 then you can comment out these lines.

  • Start the CentOS container
footloose create
  • Start a root shell
footloose ssh root@os0

Configure Docker

  • Install and start Docker
yum check-update
curl -fsSL https://get.docker.com/ | sh

Instructions from: docker.com

  • Add an insecure registry

Find the subnet:

# ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet  netmask  broadcast
  • Create /etc/docker/daemon.json
mkdir -p /etc/docker

cat > /etc/docker/daemon.json <<EOF
   "insecure-registries": [
  • Now enable / start Docker
systemctl daemon-reload \
 && systemctl enable docker \
 && systemctl start docker

Install OpenShift

  • Grab the OpenShift client tools

Find the latest URL from: https://www.okd.io/download.html

wget https://github.com/openshift/origin/releases/download/v3.11.0/openshift-origin-client-tools-v3.11.0-0cbc58b-linux-64bit.tar.gz \
  && tar -xvf openshift-origin-client-tools-v3.11.0-0cbc58b-linux-64bit.tar.gz \
  && rm -rf openshift-origin-client-tools-v3.11.0-0cbc58b-linux-64bit.tar.gz \
  && mv open* openshift
  • Make oc available via PATH
export PATH=$PATH:`pwd`/openshift
  • Authenticate to the Docker hub
docker login
  • Install OpenShift
oc cluster up --skip-registry-check=true

This will take a few minutes

If you see an error / timeout at run_self_hosted.go:181] Waiting for the kube-apiserver to be readythen run the command again until it passes.

When done you’ll see this output:

Login to server ...
Creating initial project "myproject" ...

Server Information ...
OpenShift server started.

The server is accessible via web console at:

You are logged in as:
    User:     developer
    Password: <any value>

To login as administrator:
    oc login -u system:admin

You can now install the oc tool on your host machine or access the portal through on the host.


Test your OpenShift cluster

Let’s install OpenFaaS which makes Serverless Functions Simple through the user of Docker images and Kubernetes. OpenShift is effectively a distribution of Kubernetes, so with some testing and tweaking everything should work almost out of the box.

OpenFaaS supports microservices, functions, scale to zero, source to URL and much more. Today we’ll try out one of the sample functions from the Function Store to check when an SSL certificate will expire.

  • Install OpenFaaS
oc login -u system:admin

oc adm new-project openfaas
oc adm new-project openfaas-fn

oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/alertmanager-cfg.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/alertmanager-dep.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/alertmanager-svc.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/gateway-dep.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/gateway-svc.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/nats-dep.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/nats-svc.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/prometheus-cfg.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/prometheus-dep.yml

oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/prometheus-rbac.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/prometheus-svc.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/queueworker-dep.yml
oc apply -f https://raw.githubusercontent.com/openfaas/faas-netes/master/yaml/rbac.yml

Now let’s create a route for the gateway:

cat > route.yaml << EOF
apiVersion: route.openshift.io/v1
kind: Route
  name: openfaas
  namespace: openfaas
  host: footloose-gateway.com
    kind: Service
    name: gateway
    weight: 100
  wildcardPolicy: None
    termination: edge

oc apply -f route.yaml

Add an entry to /etc/hosts footloose-gateway.com

Access the OpenFaaS UI at: https://footloose-gateway.com/


  • Install the CLI and deploy a function
export OPENFAAS_URL=https://footloose-gateway.com

faas-cli store deploy --tls-no-verify certinfo

Deployed. 202 Accepted.
URL: https://footloose-gateway.com/function/certinfo

Once the function shows Ready in the OpenFaaS UI invoke it:

export OPENFAAS_URL=https://footloose-gateway.com

echo -n www.openfaas.com | faas-cli invoke --tls-no-verify certinfo

Port 443
Issuer Let's Encrypt Authority X3
CommonName www.openfaas.com
NotBefore 2019-03-21 12:21:00 +0000 UTC
NotAfter 2019-06-19 12:21:00 +0000 UTC
NotAfterUnix 1560946860
SANs [www.openfaas.com]
TimeRemaining 2 months from now

You can grant your “developer” user access to see the openfaas / openfaas-fn projects through the following command:

oc adm policy add-cluster-role-to-user  cluster-reader developer

Here we are inspecting the Pod created by OpenFaaS for the certinfo function:



If you want to remove the OpenShift cluster you can run: footloose delete in the directory on the host.

Wrapping up

We’ve installed a functional OpenShift Origin cluster into a container and run it on a machine where the only requirement is to have Docker present. It should have taken us around 5 minutes. Once complete we deployed a production-grade application and were able to test workloads.

Whether you use minishiftVagrant – tutorial by Liz Rice or footloose using this tutorial, testing your application on OpenShift hasn’t been easier than this.

I want to give acknowledgements to Dale Bingham from Spalding Consulting and Michael Schendel from DESI for helping test and port OpenFaaS to OpenShift. This mainly involved a small patch to add an emptyDir volume for Prometheus.

What’s next?

I’ll continue to work with Dale, Michael to create a dedicated documentation page for installing OpenShift on OpenFaaS. We’ll also be testing the helm chartand all other OpenFaaS features on OpenShift Origin such as scale-to-zero and if there is interest – OpenFaaS Cloud.

Note: when using the helm chart authentication is enabled by default – just run faas-cli login.

Damien the author of Footloose is looking into how the Footloose tool could be used with a script or provisioning file to carry out all the steps of this tutorial in one single step. If you’d like to help him checkout his project at: https://github.com/weaveworks/footloose

If you’re an OpenShift user, expert or just want to help out. Please join us on Slack.


Kubernetes 1.14: Production-level support for Windows Nodes, Kubectl Updates, Persistent Local Volumes GA

  • We’re pleased to announce the delivery of Kubernetes 1.14, our first release of 2019!

    Kubernetes 1.14 consists of 31 enhancements: 10 moving to stable, 12 in beta, and 7 net new. The main themes of this release are extensibility and supporting more workloads on Kubernetes with three major features moving to general availability, and an important security feature moving to beta.

    More enhancements graduated to stable in this release than any prior Kubernetes release. This represents an important milestone for users and operators in terms of setting support expectations. In addition, there are notable Pod and RBAC enhancements in this release, which are discussed in the “additional notable features” section below.

    Let’s dive into the key features of this release:

    Production-level Support for Windows Nodes

    Up until now Windows Node support in Kubernetes has been in beta, allowing many users to experiment and see the value of Kubernetes for Windows containers. Kubernetes now officially supports adding Windows nodes as worker nodes and scheduling Windows containers, enabling a vast ecosystem of Windows applications to leverage the power of our platform. Enterprises with investments in Windows-based applications and Linux-based applications don’t have to look for separate orchestrators to manage their workloads, leading to increased operational efficiencies across their deployments, regardless of operating system.

    Some of the key features of enabling Windows containers in Kubernetes include:

    • Support for Windows Server 2019 for worker nodes and containers
    • Support for out of tree networking with Azure-CNI, OVN-Kubernetes, and Flannel
    • Improved support for pods, service types, workload controllers, and metrics/quotas to closely match the capabilities offered for Linux containers

    Notable Kubectl Updates

    New Kubectl Docs and Logo

    The documentation for kubectl has been rewritten from the ground up with a focus on managing Resources using declarative Resource Config. The documentation has been published as a standalone site with the format of a book, and it is linked from the main k8s.io documentation (available at https://kubectl.docs.kubernetes.io).

    The new kubectl logo and mascot (pronounced kubee-cuddle) are shown on the new docs site logo.

    Kustomize Integration

    The declarative Resource Config authoring capabilities of kustomize are now available in kubectl through the -k flag (e.g. for commands like apply, get) and the kustomize subcommand. Kustomize helps users author and reuse Resource Config using Kubernetes native concepts. Users can now apply directories with kustomization.yaml to a cluster using kubectl apply -k dir/. Users can also emit customized Resource Config to stdout without applying them via kubectl kustomize dir/. The new capabilities are documented in the new docs at https://kubectl.docs.kubernetes.io

    The kustomize subcommand will continue to be developed in the Kubernetes owned kustomize repo. The latest kustomize features will be available from a standalone kustomize binary (published to the kustomize repo) at a frequent release cadence, and will be updated in kubectl prior to each Kubernetes releases.

    kubectl Plugin Mechanism Graduating to Stable

    The kubectl plugin mechanism allows developers to publish their own custom kubectl subcommands in the form of standalone binaries. This may be used to extend kubectl with new higher-level functionality and with additional porcelain (e.g. adding a set-ns command).

    Plugins must have the kubectl- name prefix and exist on the user’s $PATH. The plugin mechanics have been simplified significantly for GA, and are similar to the git plugin system.

    Persistent Local Volumes are Now GA

    This feature, graduating to stable, makes locally attached storage available as a persistent volume source. Distributed file systems and databases are the primary use cases for persistent local storage due performance and cost. On cloud providers, local SSDs give better performance than remote disks. On bare metal, in addition to performance, local storage is typically cheaper and using it is a necessity to provision distributed file systems.

    PID Limiting is Moving to Beta

    Process IDs (PIDs) are a fundamental resource on Linux hosts. It is trivial to hit the task limit without hitting any other resource limits and cause instability to a host machine. Administrators require mechanisms to ensure that user pods cannot induce PID exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In addition, it is important to ensure that PIDs are limited among pods in order to ensure they have limited impact to other workloads on the node.

    Administrators are able to provide pod-to-pod PID isolation by defaulting the number of PIDs per pod as a beta feature. In addition, administrators can enable node-to-pod PID isolation as an alpha feature by reserving a number of allocatable PIDs to user pods via node allocatable. The community hopes to graduate this feature to beta in the next release.

    Additional Notable Feature Updates

    Pod priority and preemption enables Kubernetes scheduler to schedule more important Pods first and when cluster is out of resources, it removes less important pods to create room for more important ones. The importance is specified by priority.

    Pod Readiness Gates introduce an extension point for external feedback on pod readiness.

    Harden the default RBAC discovery clusterrolebindings removes discovery from the set of APIs which allow for unauthenticated access by default, improving privacy for CRDs and the default security posture of default clusters in general.


    Kubernetes 1.14 is available for download on GitHub. To get started with Kubernetes, check out these interactive tutorials. You can also easily install 1.14 using kubeadm.

    Features Blog Series

    If you’re interested in exploring these features more in depth, check back next week for our 5 Days of Kubernetes series where we’ll highlight detailed walkthroughs of the following features:

    • Day 1 – Windows Server Containers
    • Day 2 – Harden the default RBAC discovery clusterrolebindings
    • Day 3 – Pod Priority and Preemption in Kubernetes
    • Day 4 – PID Limiting
    • Day 5 – Persistent Local Volumes

    Release Team

    This release is made possible through the efforts of hundreds of individuals who contributed both technical and non-technical content. Special thanks to the release team led by Aaron Crickenberger, Senior Test Engineer at Google. The 43 individuals on the release team coordinated many aspects of the release, from documentation to testing, validation, and feature completeness.

    As the Kubernetes community has grown, our release process represents an amazing demonstration of collaboration in open source software development. Kubernetes continues to gain new users at a rapid clip. This growth creates a positive feedback cycle where more contributors commit code creating a more vibrant ecosystem. Kubernetes has had over 28,000 individual contributors to date and an active community of more than 57,000 people.

    Project Velocity

    The CNCF has continued refining DevStats, an ambitious project to visualize the myriad contributions that go into the project. K8s DevStats illustrates the breakdown of contributions from major company contributors, as well as an impressive set of preconfigured reports on everything from individual contributors to pull request lifecycle times. On average over the past year, 381 different companies and over 2,458 individuals contribute to Kubernetes each month. Check out DevStats to learn more about the overall velocity of the Kubernetes project and community.

    User Highlights

    Established, global organizations are using Kubernetes in production at massive scale. Recently published user stories from the community include:

    Is Kubernetes helping your team? Share your story with the community.

    Ecosystem Updates


The world’s largest Kubernetes gathering, KubeCon + CloudNativeCon is coming to Barcelona from May 20-23, 2019 and Shanghai (co-located with Open Source Summit) from June 24-26, 2019. These conferences will feature technical sessions, case studies, developer deep dives, salons, and more! Register today!


Join members of the Kubernetes 1.14 release team on April 23rd at 10am PDT to learn about the major features in this release. Register here.

Get Involved

The simplest way to get involved with Kubernetes is by joining one of the many Special Interest Groups (SIGs) that align with your interests. Have something you’d like to broadcast to the Kubernetes community? Share your voice at our weekly community meeting, and through the channels below.

Thank you for your continued feedback and support.


Advancing Windows Containers with Docker and Kubernetes

Today, the Cloud Native Computing Foundation (CNCF) announced Kubernetes 1.14, which includes support for Windows nodes. Kubernetes supporting Windows is a monumental step for the industry and it further confirms the work Docker has been doing with Microsoft to develop Windows containers over the past five years. It is evidence that containers are not just for Linux; Windows and .NET applications represent an important and sizeable footprint of applications that can benefit from both the Docker platform and Kubernetes.

Docker’s collaboration with Microsoft started five years ago. Today, every version of Windows Server 2016 and later ships with the Docker Engine – Enterprise. In addition, to facilitate a great user experience with Windows containers, Microsoft publishes more than 129 Windows container images of its popular software on Docker Hub. Many Docker Enterprise customers are already running mixed Windows and Linux containers with Swarm, and an upcoming release of Docker Enterprise will allow our customers to expand their Windows options to Kubernetes as well. Today both Docker Enterprise and Docker Desktop users have found that the easiest way to use and manage Kubernetes is with Docker and now these users will have the same benefits with Windows containers as well.

gMSA Support in Kubernetes for Active Directory-Authenticated Applications

In the time since that first Windows Server release, we’ve been working closely with customers to understand how to make containerized Windows applications enterprise-ready for production. Our five years of experience shows us that runtime support for Windows-based containers is only one component of what enterprises require to make Windows containers with Kubernetes operational in their environment.

An additional requirement is support for configuring containerized workloads with domain credentials and identity in an Active Directory environment. Initially we worked with Microsoft to plumb in gMSA credential support for individual containers running on Docker Engine in Windows. Next, we implemented orchestrator-wide support for gMSA credentials in Swarm. Using our experience so far, we are leading the design and implementation of gMSA support for Windows workloads in Kubernetes, delivering Alpha level support for gMSA support in Kubernetes 1.14.

Many Windows-based applications are run on domain joined hosts and use Service Accounts managed by Active Directory to access other resources and services in the domain. Windows containers, however, are not full-fledged domain joined objects. Instead, Windows containers can run with a special type of service account introduced in Windows Server 2012 called group managed service accounts (gMSA). Windows uses credentials associated with a gMSA (in lieu of individual computer accounts) to enable containerized Windows applications to access other services in an Active Directory domain.

Docker, in collaboration with Microsoft and the Kubernetes community, is working to add support for gMSA in Kubernetes. This encompasses: (1) custom resources to configure credential specs for gMSAs in a cluster wide fashion, (2) authorizing and resolving references to the gMSA credential specs from pod specifications and (3) passing down the full gMSA credential spec JSON to the container runtime (like Docker engine). This feature is in alpha with Kubernetes 1.14 and you can find more about its design and implementation here. We invite you to test this out and contribute to the effort which will help to further expand the types of applications that can be run in containers and will continue to work with the community on ensuring this reaches general availability.

Kubernetes and Windows: What’s Next

Windows admins and users also need overlay networking and dynamically provisioned storage to become enterprise-ready, and we are also working with the Kubernetes community in these areas and will have more to discuss and demonstrate at DockerCon 2019 in San Francisco. We look forward to sharing some of the progress here and hope to see you there!

To learn more about Docker and Windows containers:

Docker and Kubernetes, Docker and Windows Containers, dockercon, Kubernetes, Windows Containers and Kubernetes


Grafana Logging using Loki

Grafana Logging using Loki

Loki is a Prometheus-inspired logging service for cloud native infrastructure.

What is Loki?

Open sourced by Grafana Labs during KubeCon Seattle 2018, Loki is a logging backend optimized for users running Prometheus and Kubernetes with great logs search and visualization in Grafana 6.0.

Loki was built for efficiency alongside the following goals:

  • Logs should be cheap. Nobody should be asked to log less.
  • Easy to operate and scale.
  • Metrics, logs (and traces later) need to work together.

Loki vs other logging solutions

As said, Loki is designed for efficiency to work well in the Kubernetes context in combination with Prometheus metrics.

The idea is to switch easily between metrics and logs based on Kubernetes labels you already use with Prometheus.

Unlike most logging solutions, Loki does not parse incoming logs or do full-text indexing.

Instead, it indexes and groups log streams using the same labels you’re already using with Prometheus. This makes it significantly more efficient to scale and operate.

Loki components

Loki is a TSDB (Time-series database), it stores logs as split and gzipped chunks of data.

The logs are ingested via the API and an agent, called Promtail (Tailing logs in Prometheus format), will scrape Kubernetes logs and add label metadata before sending it to Loki.

This metadata addition is exactly the same as Prometheus, so you will end up with the exact same labels for your resources.


How to deploy Loki on your Kubernetes cluster

  1. Deploy Loki on your cluster

The easiest way to deploy Loki on your Kubernetes cluster is by using the Helm chart available in the official repository.

You can follow the setup guide from the official repo.

This will deploy Loki and Promtail.

  1. Add Loki datasource in Grafana (built-in support for Loki is in 6.0 and newer releases)
    1. Log into your Grafana.
    2. Go to Configuration > Data Sources via the cog icon in the left sidebar.
    3. Click the big + Add data source button.
    4. Choose Loki from the list.
    5. The http URL field should be the address of your Loki server: http://loki:3100
  2. See your logs in the “Explore” view
    1. Select the “Explore” view on the sidebar.
    2. Select the Loki data source.
    3. Choose a log stream using the “Log labels” button.

Promtail configuration

Promtail is the metadata appender and log sending agent

The Promtail configuration you get from the Helm chart is already configured to get all the logs from your Kubernetes cluster and append labels on it as Prometheus does for metrics.

However, you can tune the configuration for your needs.

Here are two examples:

  1. Get logs only for specific namespace

You can use the action: keep for your namespace and add a new relabel_configs for each scrape_config in promtail/configmap.yaml

For example, if you want to get logs only for the kube-system namespace:

– job_name: kubernetes-pods
– role: pod
– source_labels: [__meta_kubernetes_namespace]
action: keep
regex: kube-system

# […]

– job_name: kubernetes-pods-app
– role: pod
– source_labels: [__meta_kubernetes_namespace]
action: keep
regex: kube-system

  1. Exclude logs from specific namespace

For example, if you want to exclude logs from kube-system namespace:

You can use the action: drop for your namespace and add a new relabel_configs for each scrape_config in promtail/configmap.yaml

– job_name: kubernetes-pods
– role: pod
– source_labels: [__meta_kubernetes_namespace]
action: drop
regex: kube-system

# […]

– job_name: kubernetes-pods-app
– role: pod
– source_labels: [__meta_kubernetes_namespace]
action: drop
regex: kube-system

For more info on the configuration, you can refer to the official Prometheus configuration documentation.

Use fluentd output plugin

Fluentd is a well-known and good log forwarder that is also a [CNCF project] (https://www.cncf.io/projects/). It has a lot of input plugins and good filtering built-in. So, if you want to for example, forward journald logs to Loki, it’s not possible via Promtail so you can use the fluentd syslog input plugin with the fluentd Loki output plugin to get those logs into Loki.

You can refer to the installation guide on how to use the fluentd Loki plugin.

There’s also an example, of how to forward API server audit logs to Loki with fluentd.

Here is the fluentd configuration:

<match fluent.**>
type null
@type tail
path /var/log/apiserver/audit.log
pos_file /var/log/fluentd-audit.log.pos
time_format %Y-%m-%dT%H:%M:%S.%NZ
tag audit.*
format json
read_from_head true
<filter kubernetes.**>
type kubernetes_metadata
<match audit.**>
@type loki
url “#”
username “#”
password “#”
extra_labels {“env”:”dev”}
flush_interval 10s
flush_at_shutdown true
buffer_chunk_limit 1m

Promtail as a sidecar

By default, Promtail is configured to automatically scrape logs from containers and send them to Loki. Those logs come from stdout.

But sometimes, you may like to be able to send logs from an external file to Loki.

In this case, you can set up Promtail as a sidecar, i.e. a second container in your pod, share the log file with it through a shared volume, and scrape the data to send it to Loki

Assuming you have an application simple-logger. The application logs into /home/slog/creator.log

Your kubernetes deployment will look like this :

  1. Add Promtail as a sidecar

    apiVersion: apps/v1
    kind: Deployment
    name: my-app
    name: my-app
    – name: simple-logger
    image: giantswarm/simple-logger:latest

    apiVersion: apps/v1
    kind: Deployment
    name: my-app
    name: my-app
    – name: simple-logger
    image: giantswarm/simple-logger:latest
    – name: promtail
    image: grafana/promtail:master
    – “-config.file=/etc/promtail/promtail.yaml”
    – “-client.url=http://loki:3100/api/prom/push”

  2. Use a shared data volume containing the log file

    apiVersion: apps/v1
    kind: Deployment
    name: my-app
    name: my-app
    – name: simple-logger
    image: giantswarm/simple-logger:latest
    – name: shared-data
    mountPath: /home/slog
    – name: promtail
    image: grafana/promtail:master
    – “-config.file=/etc/promtail/promtail.yaml”
    – “-client.url=http://loki:3100/api/prom/push”
    – name: shared-data
    mountPath: /home/slog
    – name: shared-data
    emptyDir: {}

  3. Configure Promtail to read your log file

As Promtail uses the same config as Prometheus, you can use the scrape_config type static_configs to read the file you want.

– job_name: system
entry_parser: raw
– targets:
– localhost
job: my-app
my-label: awesome
__path__: /home/slog/creator.log

And you’re done.

A running example can be found here


So Loki looks very promising. The footprint is very low. It integrates nicely with Grafana and Prometheus. Having the same labels as in Prometheus is very helpful to map incidents together and quickly find logs related to metrics. Another big point is the simple scalability, Loki is horizontally scalable by design.

As Loki is currently alpha software, install it and play with it. Then, join us on grafana.slack.com and add your feedback to make it better.

Interested in finding out how Giant Swarm handles the entire cloud native stack including Loki? Request your free trial of the Giant Swarm Infrastructure here.


Kubernetes End-to-end Testing for Everyone

More and more components that used to be part of Kubernetes are now being developed outside of Kubernetes. For example, storage drivers used to be compiled into Kubernetes binaries, then were moved into stand-alone Flexvolume binaries on the host, and now are delivered as Container Storage Interface (CSI) drivers that get deployed in pods inside the Kubernetes cluster itself.

This poses a challenge for developers who work on such components: how can end-to-end (E2E) testing on a Kubernetes cluster be done for such external components? The E2E framework that is used for testing Kubernetes itself has all the necessary functionality. However, trying to use it outside of Kubernetes was difficult and only possible by carefully selecting the right versions of a large number of dependencies. E2E testing has become a lot simpler in Kubernetes 1.13.

This blog post summarizes the changes that went into Kubernetes 1.13. For CSI driver developers, it will cover the ongoing effort to also make the storage tests available for testing of third-party CSI drivers. How to use them will be shown based on two Intel CSI drivers:

Testing those drivers was the main motivation behind most of these enhancements.

E2E overview

E2E testing consists of several phases:

  • Implementing a test suite. This is the main focus of this blog post. The Kubernetes E2E framework is written in Go. It relies on Ginkgo for managing tests and Gomega for assertions. These tools support “behavior driven development”, which describes expected behavior in “specs”. In this blog post, “test” is used to reference an individual Ginkgo.It spec. Tests interact with the Kubernetes cluster using client-go.
  • Bringing up a test cluster. Tools like kubetest can help here.
  • Running an E2E test suite against that cluster. Ginkgo test suites can be run with the ginkgo tool or as a normal Go test with go test. Without any parameters, a Kubernetes E2E test suite will connect to the default cluster based on environment variables like KUBECONFIG, exactly like kubectl. Kubetest also knows how to run the Kubernetes E2E suite.

E2E framework enhancements in Kubernetes 1.13

All of the following enhancements follow the same basic pattern: they make the E2E framework more useful and easier to use outside of Kubernetes, without changing the behavior of the original Kubernetes e2e.test binary.

Splitting out provider support

The main reason why using the E2E framework from Kubernetes <= 1.12 was difficult were the dependencies on provider-specific SDKs, which pulled in a large number of packages. Just getting it compiled was non-trivial.

Many of these packages are only needed for certain tests. For example, testing the mounting of a pre-provisioned volume must first provision such a volume the same way as an administrator would, by talking directly to a specific storage backend via some non-Kubernetes API.

There is an effort to remove cloud provider-specific tests from core Kubernetes. The approach taken in PR #68483 can be seen as an incremental step towards that goal: instead of ripping out the code immediately and breaking all tests that depend on it, all cloud provider-specific code was moved into optional packages under test/e2e/framework/providers. The E2E framework then accesses it via an interface that gets implemented separately by each vendor package.

The author of a E2E test suite decides which of these packages get imported into the test suite. The vendor support is then activated via the --provider command line flag. The Kubernetes e2e.test binary in 1.13 and 1.14 still contains support for the same providers as in 1.12. It is also okay to include no packages, which means that only the generic providers will be available:

  • “skeleton”: cluster is accessed via the Kubernetes API and nothing else
  • “local”: like “skeleton”, but in addition the scripts in kubernetes/kubernetes/cluster can retrieve logs via ssh after a test suite is run

External files

Tests may have to read additional files at runtime, like .yaml manifests. But the Kubernetes e2e.test binary is supposed to be usable and entirely stand-alone because that simplifies shipping and running it. The solution in the Kubernetes build system is to link all files under test/e2e/testing-manifestsinto the binary with go-bindata. The E2E framework used to have a hard dependency on the output of go-bindata, now bindata support is optional. When accessing a file via the testfiles package, files will be retrieved from different sources:

  • relative to the directory specified with --repo-root parameter
  • zero or more bindata chunks

Test parameters

The e2e.test binary takes additional parameters which control test execution. In 2016, an effort was started to replace all E2E command line parameters with a Viper configuration file. But that effort stalled, which left developers without clear guidance how they should handle test-specific parameters.

The approach in v1.12 was to add all flags to the central test/e2e/framework/test_context.go, which does not work for tests developed independently from the framework. Since PR #69105 the recommendation has been to use the normal flag package to define its parameters, in its own source code. Flag names must be hierarchical with dots separating different levels, for example my.test.parameter, and must be unique. Uniqueness is enforced by the flag package which panics when registering a flag a second time. The new config package simplifies the definition of multiple options, which are stored in a single struct.

To summarize, this is how parameters are handled now:

  • The init code in test packages defines tests and parameters. The actual parameter values are not available yet, so test definitions cannot use them.
  • The init code of the test suite parses parameters and (optionally) the configuration file.
  • The tests run and now can use parameter values.

However, recently it was pointed out that it is desirable and was possible to not expose test settings as command line flags and only set them via a configuration file. There is an open bug and a pending PR about this.

Viper support has been enhanced. Like the provider support, it is completely optional. It gets pulled into a e2e.test binary by importing the viperconfigpackage and calling it after parsing the normal command line flags. This has been implemented so that all variables which can be set via command line flags are also set when the flag appears in a Viper config file. For example, the Kubernetes v1.13 e2e.test binary accepts --viper-config=/tmp/my-config.yaml and that file will set the my.test.parameter to value when it has this content: my: test: parameter: value

In older Kubernetes releases, that option could only load a file from the current directory, the suffix had to be left out, and only a few parameters actually could be set this way. Beware that one limitation of Viper still exists: it works by matching config file entries against known flags, without warning about unknown config file entries and thus leaving typos undetected. A better config file parser for Kubernetes is still work in progress.

Creating items from .yaml manifests

In Kubernetes 1.12, there was some support for loading individual items from a .yaml file, but then creating that item had to be done by hand-written code. Now the framework has new methods for loading a .yaml file that has multiple items, patching those items (for example, setting the namespace created for the current test), and creating them. This is currently used to deploy CSI drivers anew for each test from exactly the same .yaml files that are also used for deployment via kubectl. If the CSI driver supports running under different names, then tests are completely independent and can run in parallel.

However, redeploying a driver slows down test execution and it does not cover concurrent operations against the driver. A more realistic test scenario is to deploy a driver once when bringing up the test cluster, then run all tests against that deployment. Eventually the Kubernetes E2E testing will move to that model, once it is clearer how test cluster bringup can be extended such that it also includes installing additional entities like CSI drivers.

Upcoming enhancements in Kubernetes 1.14

Reusing storage tests

Being able to use the framework outside of Kubernetes enables building a custom test suite. But a test suite without tests is still useless. Several of the existing tests, in particular for storage, can also be applied to out-of-tree components. Thanks to the work done by Masaki Kimura, storage tests in Kubernetes 1.13 are defined such that they can be instantiated multiple times for different drivers.

But history has a habit of repeating itself. As with providers, the package defining these tests also pulled in driver definitions for all in-tree storage backends, which in turn pulled in more additional packages than were needed. This has been fixed for the upcoming Kubernetes 1.14.

Skipping unsupported tests

Some of the storage tests depend on features of the cluster (like running on a host that supports XFS) or of the driver (like supporting block volumes). These conditions are checked while the test runs, leading to skipped tests when they are not satisfied. The good thing is that this records an explanation why the test did not run.

Starting a test is slow, in particular when it must first deploy the CSI driver, but also in other scenarios. Creating the namespace for a test has been measured at 5 seconds on a fast cluster, and it produces a lot of noisy test output. It would have been possible to address that by skipping the definition of unsupported tests, but then reporting why a test isn’t even part of the test suite becomes tricky. This approach has been dropped in favor of reorganizing the storage test suite such that it first checks conditions before doing the more expensive test setup steps.

More readable test definitions

The same PR also rewrites the tests to operate like conventional Ginkgo tests, with test cases and their local variables in a single function.

Testing external drivers

Building a custom E2E test suite is still quite a bit of work. The e2e.test binary that will get distributed in the Kubernetes 1.14 test archive will have the ability to test already installed storage drivers without rebuilding the test suite. See this README for further instructions.

E2E test suite HOWTO

Test suite initialization

The first step is to set up the necessary boilerplate code that defines the test suite. In Kubernetes E2E, this is done in the e2e.go and e2e_test.gofiles. It could also be done in a single e2e_test.go file. Kubernetes imports all of the various providers, in-tree tests, Viper configuration support, and bindata file lookup in e2e_test.goe2e.go controls the actual execution, including some cluster preparations and metrics collection.

A simpler starting point are the e2e_[test].go files from PMEM-CSI. It doesn’t use any providers, no Viper, no bindata, and imports just the storage tests.

Like PMEM-CSI, OIM drops all of the extra features, but is a bit more complex because it integrates a custom cluster startup directly into the test suite, which was useful in this case because some additional components have to run on the host side. By running them directly in the E2E binary, interactive debugging with dlv becomes easier.

Both CSI drivers follow the Kubernetes example and use the test/e2e directory for their test suites, but any other directory and other file names would also work.

Adding E2E storage tests

Tests are defined by packages that get imported into a test suite. The only thing specific to E2E tests is that they instantiate a framework.Frameworkpointer (usually called f) with framework.NewDefaultFramework. This variable gets initialized anew in a BeforeEach for each test and freed in an AfterEach. It has a f.ClientSet and f.Namespace at runtime (and only at runtime!) which can be used by a test.

The PMEM-CSI storage test imports the Kubernetes storage test suite and sets up one instance of the provisioning tests for a PMEM-CSI driver which must be already installed in the test cluster. The storage test suite changes the storage class to run tests with different filesystem types. Because of this requirement, the storage class is created from a .yaml file.

Explaining all the various utility methods available in the framework is out of scope for this blog post. Reading existing tests and the source code of the framework is a good way to get started.


Vendoring Kubernetes code is still not trivial, even after eliminating many of the unnecessary dependencies. k8s.io/kubernetes is not meant to be included in other projects and does not define its dependencies in a way that is understood by tools like dep. The other k8s.io packages are meant to be included, but don’t follow semantic versioning yet or don’t tag any releases (k8s.io/kube-openapik8s.io/utils).

PMEM-CSI uses dep. It’s Gopkg.toml file is a good starting point. It enables pruning (not enabled in dep by default) and locks certain projects onto versions that are compatible with the Kubernetes version that is used. When dep doesn’t pick a compatible version, then checking Kubernetes’Godeps.json helps to determine which revision might be the right one.

Compiling and running the test suite

go test ./test/e2e -args -help is the fastest way to test that the test suite compiles.

Once it does compile and a cluster has been set up, the command go test -timeout=0 -v ./test/e2e -ginkgo.v runs all tests. In order to run tests in parallel, use the ginkgo -p ./test/e2e command instead.

Getting involved

The Kubernetes E2E framework is owned by the testing-commons sub-project in SIG-testing. See that page for contact information.

There are various tasks that could be worked on, including but not limited to:

  • Moving test/e2e/framework into a staging repo and restructuring it so that it is more modular (#74352).
  • Simplifying e2e.go by moving more of its code into test/e2e/framework (#74353).
  • Removing provider-specific code from the Kubernetes E2E test suite (#70194).

Special thanks to the reviewers of this article:


Comparing Kubernetes CNI Providers: Flannel, Calico, Canal, and Weave

Comparing Kubernetes CNI Providers: Flannel, Calico, Canal, and Weave


Network architecture is one of the more complicated aspects of many Kubernetes installations. The Kubernetes networking model itself demands certain network features but allows for some flexibility regarding the implementation. As a result, various projects have been released to address specific environments and requirements.

In this article, we’ll explore the most popular CNI plugins: flannel, calico, weave, and canal (technically a combination of multiple plugins). CNI stands for container network interface, a standard designed to make it easy to configure container networking when containers are created or destroyed. These plugins do the work of making sure that Kubernetes’ networking requirements are satisfied and providing the networking features that cluster administrators require.


Container networking is the mechanism through which containers can optionally connect to other containers, the host, and outside networks like the internet. Container runtimes offer various networking modes, each of which results in a different experience. For example Docker can configure the following networks for a container by default:

  • none: Adds the container to a container-specific network stack with no connectivity.
  • host: Adds the container to the host machine’s network stack, with no isolation.
  • default bridge: The default networking mode. Each container can connect with one another by IP address.
  • custom bridge: User-defined bridge networks with additional flexibility, isolation, and convenience features.

Docker also allows you to configure more advanced networking, including multi-host overlay networking, with additional drivers and plugins.

The idea behind the CNI initiative is to create a framework for dynamically configuring the appropriate network configuration and resources when containers are provisioned or destroyed. The CNI spec outlines a plugin interface for container runtimes to coordinate with plugins to configure networking.

Plugins are responsible for provisioning and managing an IP address to the interface and usually provide functionality related to IP management, IP-per-container assignment, and multi-host connectivity. The container runtime calls the networking plugins to allocate IP addresses and configure networking when the container starts and calls it again when the container is deleted to clean up those resources.

The runtime or orchestrator decides on the network a container should join and the plugin that it needs to call. The plugin then adds the interface into the container network namespace as one side of a veth pair. It then makes changes on the host machine, including wiring up the other part of the veth to a network bridge. Afterwards, it allocates an IP address and sets up routes by calling a separate IPAM (IP Address Management) plugin.

In the context of Kubernetes, this relationship allows kubelet to automatically configure networking for the pods it starts by calling the plugins it finds at appropriate times.


Before we compare take a look at the available CNI plugins, it’s helpful to go over some terminology that you might see while reading this or other sources discussion CNI.

Some of the most common terms include:

  • Layer 2 networking: The “data link” layer of the OSI (Open Systems Interconnection) networking model. Layer 2 deals with delivery of frames between two adjacent nodes on a network. Ethernet is a noteworthy example of Layer 2 networking, with MAC represented as a sublayer.
  • Layer 3 networking: The “network” layer of the OSI networking model. Layer 3’s primary concern involves routing packets between hosts on top of the layer 2 connections. IPv4, IPv6, and ICMP are examples of Layer 3 networking protocols.
  • VXLAN: Stands for “virtual extensible LAN”. Primarily, VXLAN is used to help large cloud deployments scale by encapsulating layer 2 Ethernet frames within UDP datagrams. VXLAN virtualization is similar to VLAN, but offers more flexibility and power (VLANs were limited to only 4,096 network IDs). VXLAN is an encapsulation and overlay protocol that runs on top of existing networks.
  • Overlay network: An overlay network is a virtual, logical network built on top of an existing network. Overlay networks are often used to provide useful abstractions on top of existing networks and to separate and secure different logical networks.
  • Encapsulation: Encapsulation is the process of wrapping network packets in additional layer to provide additional context and information. In overlay networks, encapsulation is used to translate from the virtual network to the underlying address space to route to a different location (where the packet can be de-encapsulated and continue to its destination).
  • Mesh network: A mesh network is one in which each node connects to many other nodes to cooperate on routing and achieve greater connectivity. Network meshes provide more reliable networking by allowing routing through multiple paths. The downside of a network mesh is that each additional node can add significant overhead.
  • BGP: Stands for “border gateway protocol” and is used to manage how packets are routed between edge routers. BGP helps figure out how to send a packet from one network to another by taking into account available paths, routing rules, and specific network policies. BGP is sometimes used as the routing mechanism in CNI plugins instead of encapsulated overlay networks.

Now that we’ve introduced some of the technology that enables various plugins, we’re ready to explore some of the most popular CNI options.

CNI Comparison


Flannel, a project developed by the CoreOS, is perhaps the most straightforward and popular CNI plugin available. It is one of the most mature examples of networking fabric for container orchestration systems, intended to allow for better inter-container and inter-host networking. As the CNI concept took off, a CNI plugin for Flannel was an early entry.

Compared to some other options, Flannel is relatively easy to install and configure. It is packaged as a single binary called flanneld and can be installed by default by many common Kubernetes cluster deployment tools and in many Kubernetes distributions. Flannel can use the Kubernetes cluster’s existing etcd cluster to store its state information using the API to avoid having to provision a dedicated data store.

Flannel configures a layer 3 IPv4 overlay network. A large internal network is created that spans across every node within the cluster. Within this overlay network, each node is given a subnet to allocate IP addresses internally. As pods are provisioned, the Docker bridge interface on each node allocates an address for each new container. Pods within the same host can communicate using the Docker bridge, while pods on different hosts will have their traffic encapsulated in UDP packets by flanneld for routing to the appropriate destination.

Flannel has several different types of backends available for encapsulation and routing. The default and recommended approach is to use VXLAN, as it offers both good performance and is less manual intervention than other options.

Overall, Flannel is a good choice for most users. From an administrative perspective, it offers a simple networking model that sets up an environment that’s suitable for most use cases when you only need the basics. In general, it’s a safe bet to start out with Flannel until you need something that it cannot provide.


Project Calico, or just Calico, is another popular networking option in the Kubernetes ecosystem. While Flannel is positioned as the simple choice, Calico is best known for its performance, flexibility, and power. Calico takes a more holistic view of networking, concerning itself not only with providing network connectivity between hosts and pods, but also with network security and administration. The Calico CNI plugin wraps Calico functionality within the CNI framework.

On a freshly provisioned Kubernetes cluster that meets the system requirements, Calico can be deployed quickly by applying a single manifest file. If you are interested in Calico’s optional network policy capabilities, you can enable them by applying an additional manifest to your cluster.

Although the actions needed to deploy Calico seem fairly straightforward, the network environment it creates has both simple and complex attributes. Unlike Flannel, Calico does not use an overlay network. Instead, Calico configures a layer 3 network that uses the BGP routing protocol to route packets between hosts. This means that packets do not need to be wrapped in an extra layer of encapsulation when moving between hosts. The BGP routing mechanism can direct packets natively without an extra step of wrapping traffic in an additional layer of traffic.

Besides the performance that this offers, one side effect of this is that it allows for more conventional troubleshooting when network problems arise. While encapsulated solutions using technologies like VXLAN work well, the process manipulates packets in a way that can make tracing difficult. With Calico, the standard debugging tools have access to the same information they would in simple environments, making it easier for a wider range of developers and administrators to understand behavior.

In addition to networking connectivity, Calico is well-known for its advanced network features. Network policy is one of its most sought after capabilities. In addition, Calico can also integrate with Istio, a service mesh, to interpret and enforce policy for workloads within the cluster both at the service mesh layer and the network infrastructure layer. This means that you can configure powerful rules describing how pods should be able to send and accept traffic, improving security and control over your networking environment.

Project Calico is a good choice for environments that support its requirements and when performance and features like network policy are important. Additionally, Calico offers commercial support if you’re seeking a support contract or want to keep that option open for the future. In general, it’s a good choice for when you want to be able to control your network instead of just configuring it once and forgetting about it.


Canal is an interesting option for quite a few reasons.

First of all, Canal was the name for a project that sought to integrate the networking layer provided by flannel with the networking policy capabilities of Calico. As the contributors worked through the details however, it became apparent that a full integration was not necessarily needed if work was done on both projects to ensure standardization and flexibility. As a result, the official project became somewhat defunct, but the intended ability to deploy the two technology together was achieved. For this reason, it’s still sometimes easiest to refer to the combination as “Canal” even if the project no longer exists.

Because Canal is a combination of Flannel and Calico, its benefits are also at the intersection of these two technologies. The networking layer is the simple overlay provided by Flannel that works across many different deployment environments without much additional configuration. The network policy capabilities layered on top supplement the base network with Calico’s powerful networking rule evaluation to provide additional security and control.

After ensuring that the cluster fulfills the necessary system requirements, Canal can be deployed by applying two manifests, making it no more difficult to configure than either of the projects on their own. Canal is a good way for teams to start to experiment and gain experience with network policy before they’re ready to experiment with changing their actual networking.

In general, Canal is a good choice if you like the networking model that Flannel provides but find some of Calico’s features enticing. The ability define network policy rules is a huge advantage from a security perspective and is, in many ways, Calico’s killer feature. Being able to apply that technology onto a familiar networking layer means that you can get a more capable environment without having to go through much of a transition.

Weave Net

Weave Net by Weaveworks is a CNI-capable networking option for Kubernetes that offers a different paradigm than the others we’ve discussed so far. Weave creates a mesh overlay network between each of the nodes in the cluster, allowing for flexible routing between participants. This, coupled with a few other unique features, allows Weave to intelligently route in situations that might otherwise cause problems.

To create its network, Weave relies on a routing component installed on each host in the network. These routers then exchange topology information to maintain an up-to-date view of the available network landscape. When looking to send traffic to a pod located on a different node, the weave router makes an automatic decision whether to send it via “fast datapath” or to fall back on the “sleeve” packet forwarding method.

Fast datapath is an approach that relies on the kernel’s native Open vSwitch datapath module to forward packets to the appropriate pod without moving in and out of userspace multiple times. The Weave router updates the Open vSwitch configuration to ensure that the kernel layer has accurate information about how to route incoming packets. In contrast, sleeve mode is available as a backup when the networking topology isn’t suitable for fast datapath routing. It is a slower encapsulation mode that can route packets in instances where fast datapath does not have the necessary routing information or connectivity. As traffic flows through the routers, they learn which peers are associated with which MAC addresses, allowing them to route more intelligently with fewer hops for subsequent traffic. This same mechanism helps each node self-correct when a network change alters the available routes.

Like Calico, Weave also provides network policy capabilities for your cluster. This is automatically installed and configured when you set up Weave, so no additional configuration is necessary beyond adding your network rules. One thing that Weave provides that the other options do not is easy encryption for the entire network. While it adds quite a bit of network overhead, Weave can be configured to automatically encrypt all routed traffic by using NaCl encryption for sleeve traffic and, since it needs to encrypt VXLAN traffic in the kernel, IPsec ESP for fast datapath traffic.

Weave is a great option for those looking for feature rich networking without adding a large amount of complexity or management. It is relatively easy to set up, offers many built-in and automatically configured features, and can provide routing in scenarios where other solutions might fail. The mesh topography does put a limit on the size of the network that can be reasonably accommodated, but for most users, this won’t be a problem. Additionally, Weave offers paid support for organizations that prefer to be able to have someone to contact for help and troubleshooting.


Kubernetes’ adoption of the CNI standard allows for many different network solutions to exist within the same ecosystem. The diversity of options available means that most users will be able to find a CNI plugin that suits their current needs and deployment environment, while also providing solutions when their circumstances change. Operating requirements vary immensely between organizations, so having a number of mature solutions with different levels of complexity and feature richness helps Kubernetes satisfy unique requirements while still offering a fairly consistent user experience.