We are proud to introduce Tarmak, an open source toolkit for Kubernetes cluster lifecycle management that focuses on best practice cluster security, management
and operation. It has been built from the ground-up to be cloud provider-agnostic and provides a means for consistent and reliable cluster deployment and management, across clouds and on-premises environments.
This blog post is a follow-up to a talk Matt
Bates and I gave at PuppetConf
2017. The slides can be
found
here
and a recording of the session can be found at the end of this post (click here to watch).
Jetstack have extensive experience deploying Kubernetes into production with many different
clients. We have learned what works (and importantly, what works not so well)
and worked through several generations of cluster deployment. In the talk, we described these
challenges. To summarise:
- Immutable infrastructure isn’t always that desirable.
- Ability to test and debug is critical for development and operations.
- Dependencies need to be versioned.
- Cluster PKI management in dynamic environments is not easy.
Tarmak and its underlying components are the product of Jetstack’s work with
its customers to build and deploy Kubernetes in production at scale.
In this post, we’ll explore the lessons we learned and the motivations for Tarmak, and
dive into the tools and the provisioning mechanics. Firstly, the motivations that were born out of the lessons learned:
Improved developer and operator experience
A major goal for the tooling was to provide an easy-to-use and natural UX – for both
developers and operators.
In previous generations of cluster deployment, one area of concern with immutable
replacement of configuration changes was the long and expensive
feedback loop. It took significant time for a code change to be deployed into a real-world
cluster, and a simple careless mistake in a JSON file could take up to 30 minutes to realise
and fix. Using tests at multiple levels (unit, integration) on all code involved, helps to
catch errors that prevent a cluster from building early.
Another problem, especially with the Bash scripts, was that whilst they would
work fine with one specific configuration, once you had some input
parameters they were really hard to maintain. Scripts were modified and duplicated
and and this quickly became difficult to maintain effectively. So our goal for the new project was
to follow coding best practices: “Don’t repeat yourself”
(DRY) and “Keep it
simple, stupid” (KISS). This
helps to reduce the complexity of later changes and helps to achieve a modular
design.
With replacing instances on every configuration change, it’s not easily possible to
get an idea what changes are about to happen on the instance’s configuration. It
would be great to have better insights into the changes that will be performed, by
having a dry-run capability.
Another important observation was that using a more traditional approach of running
software helps engineers to transition more smoothly into a container-centric
world. Whilst Kubernetes can be used to “self-host” its own components, we recognised
that is greater familiarity (at this stage) with tried-and-tested and traditional tools in
operations teams, so we adopted systemd and use the vanilla open source Kubernetes
binaries.
Less disruptive cluster upgrades
In many cases with existing tooling, cluster upgrades involve replacing instances; when you want to change something, the entire instance is replaced with a new one that contains the new configuration. A number of limitations started to emerge from this strategy.
- Replacing instances can get time and cost expensive, especially in large clusters.
- There is no control over our rolled-out instances – their actual state might
have diverged from the desired state. - Draining Kubernetes worker instances is often a quite manual process.
- Every replacement comes with risks: someone might use latest tags, configuration
no longer valid. - Cached content is lost throughout the whole cluster and needs to be rebuilt.
- Stateful applications need to migrate data over to other instances (and this is often
a resource intensive process for some applications).
Tarmak has been designed with these factors in mind. We support both in-place upgrades, as well as full instance replacement. This allows operators to choose how they would like their clusters to be upgraded, to ensure that whatever cluster=level operation they are undertaking, it is performed in the least disruptive way possible.
Consistency between environments
Another benefit of the new tools should be that they should be designed to
provide a consistent deployment across different cloud providers and
on-premises setups. We consistently hear from customers that they do not wish to skill-up
operations teams with a multitide of provisioning tools and techniques, not least because
of the operational risk it poses when trying to reason about cluster configuration and
health at times of failure.
With Tarmak, we have developed the right tool to be able to address these
challenges.
We identified Infrastructure, Configuration and Application as the three core
layers of set-up in a Kubernetes cluster.
- Infrastructure: all core resources (like compute, network,
storage) are created and configured to be able to work together. We use
Terraform to plan and execute these changes. At the end of this stage, the infrastructure is
ready to run our own bespoke ‘Tarmak instance agent’ (Wing), required for the
configuration stage. - Configuration: The Wing agent is in the core of the configuration layer and
uses Puppet manifests to configure all instances in a cluster accordingly. After
Wing has been run it sends reports back to the Wing apiserver, which can be run in
a highly available configuration. Once all instances in a cluster have successfully
executed Wing, the Kubernetes cluster is up and running and provides its API as an interface. - Applications: The core cluster add-ons are deployed with the help of Puppet. Any other tool
like kubectl or Helm can also be used to manage the lifecycle of these applications on the
cluster.
Abstractions and chosen tools
Infrastructure
As part of the Infrastructure provisioning stage, we use Terraform to set up
instance that later get configured to fulfill one of the following roles:
- Bastion is the only node that has a public IP address assigned. It is
used as a “jump host” to connect to services on the private networks of clusters.
It also runs the Wing apiserver responsible for aggregating the state information of
instances. - Vault instances provide a dynamic CA (Certificate Authority)-as-a-service for the
various cluster components that rely on TLS authentication. It also runs Consul as a backend
for Vault and stores its data on persistent disks, encrypted and secured. - etcd instances store the state for the Kubernetes control plane. They
have persistent disks and run etcd HA (i.e. 3+ instances): one for Kubernetes,
another one dedicated to Kubernetes’ events and the third for the overlay
network (Calico, by default). - Kubernetes Masters are running the Kubernetes control plane components in a highly available
configuration. - Kubernetes Workers are running your organisation’s application workloads.
In addition to the creation of these instances, an object store is populated
with Puppet manifests that are later used to spin up services on the
instances. The same manifests are distributed to all nodes in the cluster.
Infrastructure layer
Configuration
The configuration phase starts when an instance gets started or a re-run is
requested using Tarmak. Wing fetches the latest Puppet manifests from the
object store and applies the manifest on the instance until the manifests have
been converged. Meanwhile, Wing sends status updates to the Wing apiserver.
The Puppet manifests are designed so as not to require Puppet once any required
changes have been applied. The startup of the services are managed using standard
systemd units, and timers are used for recurring tasks like the renewal of
certificates.
The Puppet modules powering these configuration steps have been implemented in
cooperation with Compare the Market — this
should also explain the ‘Meerkats’ in the talk title! 🙂
Configuration layer
You can get started with Tarmak by following our AWS getting started
guide.
We’d love to hear feedback and take contributions in the Tarmak project (Apache 2.0 licensed) on GitHub.
We are actively working on making Tarmak more accessible to external
contributors. Our next steps are:
- Splitting out the Puppet modules into separate repositories.
- Move issue tracking (GitHub) and CI (Travis CI) out to the open.
- Improved documentation.
In our next blog post we’ll explain why Tarmak excels at quick and non-disruptive Kubernetes cluster upgrades, using the power of Wing – stay tuned!