{"id":388,"date":"2018-10-16T17:32:44","date_gmt":"2018-10-16T17:32:44","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw93\/?p=388"},"modified":"2018-10-17T09:07:09","modified_gmt":"2018-10-17T09:07:09","slug":"introducing-tarmak-the-toolkit-for-kubernetes-cluster-provisioning-and-management","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw93\/index.php\/2018\/10\/16\/introducing-tarmak-the-toolkit-for-kubernetes-cluster-provisioning-and-management\/","title":{"rendered":"Introducing Tarmak &#8211; the toolkit for Kubernetes cluster provisioning and management \/"},"content":{"rendered":"<p>By <a target=\"\">Christian Simon<\/a><\/p>\n<p>We are proud to introduce <a href=\"https:\/\/github.com\/jetstack\/tarmak\">Tarmak<\/a>, an open source toolkit for Kubernetes cluster lifecycle management that focuses on best practice cluster security, management<br \/>\nand operation. It has been built from the ground-up to be cloud provider-agnostic and provides a means for consistent and reliable cluster deployment and management, across clouds and on-premises environments.<\/p>\n<p>This blog post is a follow-up to a talk <a href=\"https:\/\/twitter.com\/mattbates25\">Matt<br \/>\nBates<\/a> and I gave at <a href=\"https:\/\/puppetconf17.sched.com\/event\/B4wW\">PuppetConf<br \/>\n2017<\/a>. The slides can be<br \/>\nfound<br \/>\n<a href=\"https:\/\/www.slideshare.net\/mjbarks\/from-rollercoasters-to-meerkats-3-generations-of-production-kubernetes-clusters-80772675\">here<\/a><br \/>\nand a recording of the session can be found at the end of this post (click <a href=\"#watch-recording\">here<\/a> to watch).<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blog.jetstack.io\/blog\/introducing-tarmak\/logo-400px.png\" alt=\"Tarmak logo\" \/><\/p>\n<p>Jetstack have extensive experience deploying Kubernetes into production with many different<br \/>\nclients. We have learned what works (and importantly, what works not so well)<br \/>\nand worked through several generations of cluster deployment. In the <a href=\"https:\/\/www.slideshare.net\/mjbarks\/from-rollercoasters-to-meerkats-3-generations-of-production-kubernetes-clusters-80772675\">talk<\/a>, we described these<br \/>\nchallenges. To summarise:<\/p>\n<ul>\n<li>Immutable infrastructure isn\u2019t always that desirable.<\/li>\n<li>Ability to test and debug is critical for development and operations.<\/li>\n<li>Dependencies need to be versioned.<\/li>\n<li>Cluster PKI management in dynamic environments is not easy.<\/li>\n<\/ul>\n<p>Tarmak and its underlying components are the product of Jetstack\u2019s work with<br \/>\nits customers to build and deploy Kubernetes in production at scale.<\/p>\n<p>In this post, we\u2019ll explore the lessons we learned and the motivations for Tarmak, and<br \/>\ndive into the tools and the provisioning mechanics. Firstly, the motivations that were born out of the lessons learned:<\/p>\n<h2>Improved developer and operator experience<\/h2>\n<p>A major goal for the tooling was to provide an easy-to-use and natural UX &#8211; for both<br \/>\ndevelopers and operators.<\/p>\n<p>In previous generations of cluster deployment, one area of concern with immutable<br \/>\nreplacement of configuration changes was the long and expensive<br \/>\nfeedback loop. It took significant time for a code change to be deployed into a real-world<br \/>\ncluster, and a simple careless mistake in a JSON file could take up to 30 minutes to realise<br \/>\nand fix. Using tests at multiple levels (unit, integration) on all code involved, helps to<br \/>\ncatch errors that prevent a cluster from building early.<\/p>\n<p>Another problem, especially with the Bash scripts, was that whilst they would<br \/>\nwork fine with one specific configuration, once you had some input<br \/>\nparameters they were really hard to maintain. Scripts were modified and duplicated<br \/>\nand and this quickly became difficult to maintain effectively. So our goal for the new project was<br \/>\nto follow coding best practices: \u201cDon\u2019t repeat yourself\u201d<br \/>\n(<a href=\"https:\/\/en.wikipedia.org\/wiki\/Don%27t_repeat_yourself\">DRY<\/a>) and \u201cKeep it<br \/>\nsimple, stupid\u201d (<a href=\"https:\/\/en.wikipedia.org\/wiki\/KISS_principle\">KISS<\/a>). This<br \/>\nhelps to reduce the complexity of later changes and helps to achieve a modular<br \/>\ndesign.<\/p>\n<p>With replacing instances on every configuration change, it\u2019s not easily possible to<br \/>\nget an idea what changes are about to happen on the instance\u2019s configuration. It<br \/>\nwould be great to have better insights into the changes that will be performed, by<br \/>\nhaving a dry-run capability.<\/p>\n<p>Another important observation was that using a more traditional approach of running<br \/>\nsoftware helps engineers to transition more smoothly into a container-centric<br \/>\nworld. Whilst Kubernetes can be used to \u201cself-host\u201d its own components, we recognised<br \/>\nthat is greater familiarity (at this stage) with tried-and-tested and traditional tools in<br \/>\noperations teams, so we adopted systemd and use the vanilla open source Kubernetes<br \/>\nbinaries.<\/p>\n<h2>Less disruptive cluster upgrades<\/h2>\n<p>In many cases with existing tooling, cluster upgrades involve replacing instances; when you want to change something, the entire instance is replaced with a new one that contains the new configuration. A number of limitations started to emerge from this strategy.<\/p>\n<ul>\n<li>Replacing instances can get time and cost expensive, especially in large clusters.<\/li>\n<li>There is no control over our rolled-out instances &#8211; their actual state might<br \/>\nhave diverged from the desired state.<\/li>\n<li>Draining Kubernetes worker instances is often a quite manual process.<\/li>\n<li>Every replacement comes with risks: someone might use latest tags, configuration<br \/>\nno longer valid.<\/li>\n<li>Cached content is lost throughout the whole cluster and needs to be rebuilt.<\/li>\n<li>Stateful applications need to migrate data over to other instances (and this is often<br \/>\na resource intensive process for some applications).<\/li>\n<\/ul>\n<p>Tarmak has been designed with these factors in mind. We support both in-place upgrades, as well as full instance replacement. This allows operators to choose how they would like their clusters to be upgraded, to ensure that whatever cluster=level operation they are undertaking, it is performed in the least disruptive way possible.<\/p>\n<h2>Consistency between environments<\/h2>\n<p>Another benefit of the new tools should be that they should be designed to<br \/>\nprovide a consistent deployment across different cloud providers and<br \/>\non-premises setups. We consistently hear from customers that they do not wish to skill-up<br \/>\noperations teams with a multitide of provisioning tools and techniques, not least because<br \/>\nof the operational risk it poses when trying to reason about cluster configuration and<br \/>\nhealth at times of failure.<\/p>\n<p>With Tarmak, we have developed the right tool to be able to address these<br \/>\nchallenges.<\/p>\n<p>We identified Infrastructure, Configuration and Application as the three core<br \/>\nlayers of set-up in a Kubernetes cluster.<\/p>\n<ul>\n<li>Infrastructure: all core resources (like compute, network,<br \/>\nstorage) are created and configured to be able to work together. We use<br \/>\nTerraform to plan and execute these changes. At the end of this stage, the infrastructure is<br \/>\nready to run our own bespoke \u2018Tarmak instance agent\u2019 (Wing), required for the<br \/>\nconfiguration stage.<\/li>\n<li>Configuration: The Wing agent is in the core of the configuration layer and<br \/>\nuses Puppet manifests to configure all instances in a cluster accordingly. After<br \/>\nWing has been run it sends reports back to the Wing apiserver, which can be run in<br \/>\na highly available configuration. Once all instances in a cluster have successfully<br \/>\nexecuted Wing, the Kubernetes cluster is up and running and provides its API as an interface.<\/li>\n<li>Applications: The core cluster add-ons are deployed with the help of Puppet. Any other tool<br \/>\nlike kubectl or Helm can also be used to manage the lifecycle of these applications on the<br \/>\ncluster.<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/blog.jetstack.io\/blog\/introducing-tarmak\/abstractions-solutions.png\" alt=\"Abstractions and chosen solutions\" \/><\/p>\n<p>Abstractions and chosen tools<\/p>\n<h2>Infrastructure<\/h2>\n<p>As part of the Infrastructure provisioning stage, we use Terraform to set up<br \/>\ninstance that later get configured to fulfill one of the following roles:<\/p>\n<ul>\n<li>Bastion is the only node that has a public IP address assigned. It is<br \/>\nused as a \u201cjump host\u201d to connect to services on the private networks of clusters.<br \/>\nIt also runs the Wing apiserver responsible for aggregating the state information of<br \/>\ninstances.<\/li>\n<li>Vault instances provide a dynamic CA (Certificate Authority)-as-a-service for the<br \/>\nvarious cluster components that rely on TLS authentication. It also runs Consul as a backend<br \/>\nfor Vault and stores its data on persistent disks, encrypted and secured.<\/li>\n<li>etcd instances store the state for the Kubernetes control plane. They<br \/>\nhave persistent disks and run etcd HA (i.e. 3+ instances): one for Kubernetes,<br \/>\nanother one dedicated to Kubernetes\u2019 events and the third for the overlay<br \/>\nnetwork (<a href=\"https:\/\/github.com\/projectcalico\">Calico<\/a>, by default).<\/li>\n<li>Kubernetes Masters are running the Kubernetes control plane components in a highly available<br \/>\nconfiguration.<\/li>\n<li>Kubernetes Workers are running your organisation\u2019s application workloads.<\/li>\n<\/ul>\n<p>In addition to the creation of these instances, an object store is populated<br \/>\nwith Puppet manifests that are later used to spin up services on the<br \/>\ninstances. The same manifests are distributed to all nodes in the cluster.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blog.jetstack.io\/blog\/introducing-tarmak\/infra-layer.png\" alt=\"Infrastructure layer\" \/><\/p>\n<p>Infrastructure layer<\/p>\n<h2>Configuration<\/h2>\n<p>The configuration phase starts when an instance gets started or a re-run is<br \/>\nrequested using Tarmak. Wing fetches the latest Puppet manifests from the<br \/>\nobject store and applies the manifest on the instance until the manifests have<br \/>\nbeen converged. Meanwhile, Wing sends status updates to the Wing apiserver.<\/p>\n<p>The Puppet manifests are designed so as not to require Puppet once any required<br \/>\nchanges have been applied. The startup of the services are managed using standard<br \/>\nsystemd units, and timers are used for recurring tasks like the renewal of<br \/>\ncertificates.<\/p>\n<p>The Puppet modules powering these configuration steps have been implemented in<br \/>\ncooperation with <a href=\"https:\/\/www.comparethemarket.com\/\">Compare the Market<\/a> \u2014 this<br \/>\nshould also explain the \u2018Meerkats\u2019 in the talk title! \ud83d\ude42<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blog.jetstack.io\/blog\/introducing-tarmak\/config-layer.png\" alt=\"Configuration layer\" \/><\/p>\n<p>Configuration layer<\/p>\n<p>You can get started with Tarmak by following our <a href=\"http:\/\/docs.tarmak.io\/en\/latest\/user-guide.html#getting-started-with-aws\">AWS getting started<br \/>\nguide<\/a>.<\/p>\n<p>We\u2019d love to hear feedback and take contributions in the <a href=\"https:\/\/github.com\/jetstack\/tarmak\">Tarmak<\/a> project (Apache 2.0 licensed) on GitHub.<\/p>\n<p>We are actively working on making Tarmak more accessible to external<br \/>\ncontributors. Our next steps are:<\/p>\n<ul>\n<li>Splitting out the Puppet modules into separate repositories.<\/li>\n<li>Move issue tracking (GitHub) and CI (Travis CI) out to the open.<\/li>\n<li>Improved documentation.<\/li>\n<\/ul>\n<p>In our next blog post we\u2019ll explain why Tarmak excels at quick and non-disruptive Kubernetes cluster upgrades, using the power of Wing &#8211; stay tuned!<\/p>\n<p><a href=\"https:\/\/blog.jetstack.io\/blog\/introducing-tarmak\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Christian Simon We are proud to introduce Tarmak, an open source toolkit for Kubernetes cluster lifecycle management that focuses on best practice cluster security, management and operation. It has been built from the ground-up to be cloud provider-agnostic and provides a means for consistent and reliable cluster deployment and management, across clouds and on-premises &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw93\/index.php\/2018\/10\/16\/introducing-tarmak-the-toolkit-for-kubernetes-cluster-provisioning-and-management\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Introducing Tarmak &#8211; the toolkit for Kubernetes cluster provisioning and management \/&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-388","post","type-post","status-publish","format-standard","hentry","category-kubernetes"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/388","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/comments?post=388"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/388\/revisions"}],"predecessor-version":[{"id":534,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/388\/revisions\/534"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/media?parent=388"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/categories?post=388"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/tags?post=388"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}