{"id":576,"date":"2018-10-17T17:08:01","date_gmt":"2018-10-17T17:08:01","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw93\/?p=576"},"modified":"2018-10-17T17:28:35","modified_gmt":"2018-10-17T17:28:35","slug":"recover-rancher-kubernetes-cluster-from-a-backup","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw93\/index.php\/2018\/10\/17\/recover-rancher-kubernetes-cluster-from-a-backup\/","title":{"rendered":"Recover Rancher Kubernetes cluster from a Backup"},"content":{"rendered":"<h5>Take a deep dive into Best Practices in Kubernetes Networking<\/h5>\n<p>From overlay networking and SSL to ingress controllers and network security policies, we&#8217;ve seen many users get hung up on Kubernetes networking challenges. In this video recording, we dive into Kubernetes networking, and discuss best practices for a wide variety of deployment options.<\/p>\n<p><a href=\"https:\/\/rancher.com\/events\/2018\/kubernetes-networking-masterclass-june-online-meetup\/\" target=\"blank\">Watch the video<\/a><\/p>\n<p>Etcd is a highly available distributed key-value store that provides a reliable way to store data across machines, more importantly it is used as a Kubernetes\u2019 backing store for all of a cluster\u2019s data.<\/p>\n<p>In this post we are going to discuss how to backup etcd and how to recover from a backup to restore operations to a Kubernetes cluster.<\/p>\n<h2>Etcd in Rancher 1.6<\/h2>\n<p>In Rancher 1.6 we use our own Docker <a href=\"https:\/\/github.com\/rancher\/rancher-etcd\">image<\/a> for etcd which basically pulls the official etcd and adds some scripts and go binaries for orchestration, backup, disaster recovery, and healthcheck.<\/p>\n<p>The scripts communicate with Rancher\u2019s metadata service to get important information, such as: how many etcd are running in the cluster, who is the etcd leader, etc. In Rancher 1.6, we introduced etcd backup, which works besides the main etcd in the background. This service is responsible for backup operations.<\/p>\n<p>The backup operations work by performing rolling backups of etcd at specified intervals and also supports retention of old backups. Rancher-etcd does that by providing three environment variables to the Docker image:<\/p>\n<ul>\n<li>EMBEDDED_BACKUPS: boolean variable to enable\/disable backup.<\/li>\n<li>BACKUP_PERIOD: etcd will perform backups at this time interval.<\/li>\n<li>BACKUP_RETENTION: etcd will retain backups for this time interval.<\/li>\n<\/ul>\n<p>Backups are taken at \/var\/etcd\/backups on the host and are taken using the following command:<\/p>\n<p>etcdctl backup &#8211;data-dir &lt;dataDir&gt; &#8211;backup-dir &lt;backupDir&gt;<\/p>\n<p>To configure the backup operations for etcd in Rancher 1.6, you must supply the mentioned environment variables in the Kubernetes configuration template:<\/p>\n<p>After configuring and launching Kubernetes, etcd should automatically take backups every 15 minutes by default.<\/p>\n<h2>Restoring backup<\/h2>\n<p>Recovering etcd from a backup in rancher 1.6 requires the user to have data in the etcd volume created for etcd. For example, if you have 3 nodes and you have backups created in the \/var\/etcd\/backup directory:<\/p>\n<p># ls \/var\/etcd\/backups\/ -l<br \/>\ntotal 44<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:03 2018-04-09T15:03:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:05 2018-04-09T15:05:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:07 2018-04-09T15:07:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:09 2018-04-09T15:09:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:11 2018-04-09T15:11:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:13 2018-04-09T15:13:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:15 2018-04-09T15:15:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:17 2018-04-09T15:17:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:19 2018-04-09T15:19:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:21 2018-04-09T15:21:54Z_etcd_1<br \/>\ndrwx&#8212;&#8212; 3 root root 4096 Apr 9 15:23 2018-04-09T15:23:54Z_etcd_1<\/p>\n<p>Then you should be able to restore operations to etcd. First of all you should only start with one node, so that only one etcd will restore from backup, and then the rest of etcd will join the cluster. To begin the restoration, use the following steps:<\/p>\n<p>target=2018-04-09T15:23:54Z_etcd_1<br \/>\ndocker volume create &#8211;name etcd<br \/>\ndocker run -d -v etcd:\/data &#8211;name etcd-restore busybox<br \/>\ndocker cp \/var\/etcd\/backups\/$target etcd-restore:\/data\/data.current<br \/>\ndocker rm etcd-restore<\/p>\n<p>The next step is to start Kubernetes on this node normally:<\/p>\n<p>After that you can add new hosts to the setup. Note that you have to make sure that new hosts don\u2019t have etcd volumes.<\/p>\n<p>It\u2019s also preferable to have etcd backup mounted to NFS mount point so that if the hosts are down for any reason, it won\u2019t affect the backups created for etcd.<\/p>\n<h2>Etcd in Rancher 2.0<\/h2>\n<p>Recently Rancher announced <a href=\"https:\/\/rancher.com\/blog\/2018\/2018-05-01-rancher-ga-announcement-sheng-liang\/\">GA for Rancher 2.0<\/a> and became ready for production deployments. Rancher 2.0 provides unified cluster management for different cloud providers including GKE, AKS, EKS as well providers that do not yet support a managed Kubernetes service.<\/p>\n<p>Starting from RKE v0.1.7, the user is allowed to enable regular etcd snapshots automatically. In addition, it lets the user restore etcd from a snapshot stored on cluster instances.<\/p>\n<p>In this section we will explain how to backup\/restore your Rancher installation on an RKE managed cluster. The steps for this kind of Rancher installation is explained in the <a href=\"https:\/\/rancher.com\/docs\/rancher\/v2.x\/en\/installation\/ha-server-install\/\">official documentation<\/a> in more detail.<\/p>\n<h3>After Rancher Installation<\/h3>\n<p>After you install Rancher using RKE as explained in the documentation, you should see similar output when you execute the command:<\/p>\n<p># kubectl get pods &#8211;all-namespaces<br \/>\nNAMESPACE NAME READY STATUS RESTARTS AGE<br \/>\ncattle-system cattle-859b6cdc6b-tns6g 1\/1 Running 0 19s<br \/>\ningress-nginx default-http-backend-564b9b6c5b-7wbkx 1\/1 Running 0 25s<br \/>\ningress-nginx nginx-ingress-controller-shpn4 1\/1 Running 0 25s<br \/>\nkube-system canal-5xj2r 3\/3 Running 0 37s<br \/>\nkube-system kube-dns-5ccb66df65-c72t9 3\/3 Running 0 31s<br \/>\nkube-system kube-dns-autoscaler-6c4b786f5-xtj26 1\/1 Running 0 30s<\/p>\n<p>You will notice that cattle pod is up and running in cattle-system namespace; this pod is the rancher server installed as a Kubernetes deployment:<\/p>\n<h3>RKE etcd Snapshots<\/h3>\n<p>RKE introduced two commands to save and restore etcd snapshots of a running RKE cluster; the two commands are:<\/p>\n<p>rke etcd snapshot-save &#8211;config &lt;config-path&gt; &#8211;name &lt;snapshot-name&gt;<\/p>\n<p>AND<\/p>\n<p>rke etcd snapshot-restore &#8211;config &lt;config-path&gt; &#8211;name &lt;snapshot-name&gt;<\/p>\n<p>For more information about etcd snapshot save\/restore in RKE, please refer to the official <a href=\"https:\/\/rancher.com\/docs\/rancher\/v2.x\/en\/installation\/backups-and-restoration\/ha-backup-and-restoration\/\">documentation<\/a>.<\/p>\n<p>First we will take a snapshot of etcd that is running on the cluster. To do that, lets run the following command:<\/p>\n<p># rke etcd snapshot-save &#8211;name rancher.snapshot &#8211;config cluster.yml<br \/>\nINFO[0000] Starting saving snapshot on etcd hosts<br \/>\nINFO[0000] [dialer] Setup tunnel for host [x.x.x.x]<br \/>\nINFO[0003] [etcd] Saving snapshot [rancher.snapshot] on host [x.x.x.x]<br \/>\nINFO[0004] [etcd] Successfully started [etcd-snapshot-once] container on host [x.x.x.x]<br \/>\nINFO[0010] Finished saving snapshot [rancher.snapshot] on all etcd hosts<\/p>\n<h3>RKE etcd snapshot restore<\/h3>\n<p>Assuming the Kubernetes cluster failed for any reason, we can restore normally from the taken snapshot, using the following command:<\/p>\n<p># rke etcd snapshot-restore &#8211;name rancher.snapshot &#8211;config cluster.yml<\/p>\n<p>INFO[0000] Starting restoring snapshot on etcd hosts<br \/>\nINFO[0000] [dialer] Setup tunnel for host [x.x.x.x]<br \/>\nINFO[0001] [remove\/etcd] Successfully removed container on host [x.x.x.x]<br \/>\nINFO[0001] [hosts] Cleaning up host [x.x.x.x]<br \/>\nINFO[0001] [hosts] Running cleaner container on host [x.x.x.x]<br \/>\nINFO[0002] [kube-cleaner] Successfully started [kube-cleaner] container on host [x.x.x.x]<br \/>\nINFO[0002] [hosts] Removing cleaner container on host [x.x.x.x]<br \/>\nINFO[0003] [hosts] Successfully cleaned up host [x.x.x.x]<br \/>\nINFO[0003] [etcd] Restoring [rancher.snapshot] snapshot on etcd host [x.x.x.x]<br \/>\nINFO[0003] [etcd] Successfully started [etcd-restore] container on host [x.x.x.x]<br \/>\nINFO[0004] [etcd] Building up etcd plane..<br \/>\nINFO[0004] [etcd] Successfully started [etcd] container on host [x.x.x.x]<br \/>\nINFO[0005] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.x]<br \/>\nINFO[0006] [remove\/rke-log-linker] Successfully removed container on host [x.x.x.x]<br \/>\nINFO[0006] [etcd] Successfully started etcd plane..<br \/>\nINFO[0007] Finished restoring snapshot [rancher.snapshot] on all etcd hosts<\/p>\n<blockquote><p>Notes<br \/>\nThere are some important notes for the etcd restore process in RKE:<\/p><\/blockquote>\n<h4>1. Restarting Kubernetes components<\/h4>\n<p>After restoring the cluster, you have to restart the Kubernetes components on all nodes, otherwise there will be some conflicts with resource versions of objects stored in etcd; this will include restart to Kubernetes components and the network components. For more information, please refer to <a href=\"https:\/\/Kubernetes.io\/docs\/tasks\/administer-cluster\/configure-upgrade-etcd\/#etcd-upgrade-requirements\">Kubernetes documentation<\/a>. To restart the Kubernetes components, you can run the following on each node:<\/p>\n<p>docker restart kube-apiserver kubelet kube-controller-manager kube-scheduler kube-proxy<br \/>\ndocker ps | grep flannel | cut -f 1 -d &#8221; &#8221; | xargs docker restart<br \/>\ndocker ps | grep calico | cut -f 1 -d &#8221; &#8221; | xargs docker restart<\/p>\n<h4>2. Restoring etcd on a multi-node cluster<\/h4>\n<p>If you are restoring etcd on a cluster with multiple etcd nodes, the same exact snapshot must be copied to \/opt\/rke\/etcd-snapshots, rke etcd snapshot-save will take different snapshots on each node, so you will need to copy one of the created snapshots manually to all nodes before restoring.<\/p>\n<h4>3. Invalidated service account tokens<\/h4>\n<p>Restoring etcd on a new Kubernetes cluster with new certificates is not currently supported, because the new cluster will contain different private keys which are used to sign service tokens for all service accounts. This may cause a lot of problems for all pods that communicate directly with kube api.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post we saw how backups can be created and restored for etcd in Kubernetes clusters in both Rancher 1.6.x and 2.0.x. Etcd snapshots can be managed in 1.6 using Rancher\u2019s etcd image and in 2.0 using RKE CLI.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/rancher.com\/img\/bio\/bio-user.jpg\" alt=\"Hussein Galal\" width=\"100\" height=\"100\" \/><\/p>\n<p>Hussein Galal<\/p>\n<p>DevOps Engineer<\/p>\n<p><a href=\"https:\/\/rancher.com\/blog\/2018\/2018-05-30-recover-rancher-kubernetes-cluster-from-backup\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Take a deep dive into Best Practices in Kubernetes Networking From overlay networking and SSL to ingress controllers and network security policies, we&#8217;ve seen many users get hung up on Kubernetes networking challenges. In this video recording, we dive into Kubernetes networking, and discuss best practices for a wide variety of deployment options. Watch the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw93\/index.php\/2018\/10\/17\/recover-rancher-kubernetes-cluster-from-a-backup\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Recover Rancher Kubernetes cluster from a Backup&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-576","post","type-post","status-publish","format-standard","hentry","category-kubernetes"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/576","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/comments?post=576"}],"version-history":[{"count":1,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/576\/revisions"}],"predecessor-version":[{"id":578,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/576\/revisions\/578"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/media?parent=576"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/categories?post=576"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/tags?post=576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}