April 2019 – Art2Dec SoftLab

The Local Persistent Volumes feature has been promoted to GA in Kubernetes 1.14. It was first introduced as alpha in Kubernetes 1.7, and then beta in Kubernetes 1.10. The GA milestone indicates that Kubernetes users may depend on the feature and its API for production use. GA features are protected by the Kubernetes deprecation policy.

What is a Local Persistent Volume?

A local persistent volume represents a local disk directly-attached to a single Kubernetes Node.

Kubernetes provides a powerful volume plugin system that enables Kubernetes workloads to use a wide variety of block and file storage to persist data. Most of these plugins enable remote storage – these remote storage systems persist data independent of the Kubernetes node where the data originated. Remote storage usually can not offer the consistent high performance guarantees of local directly-attached storage. With the Local Persistent Volume plugin, Kubernetes workloads can now consume high performance local storage using the same volume APIs that app developers have become accustomed to.

How is it different from a HostPath Volume?

To better understand the benefits of a Local Persistent Volume, it is useful to compare it to a HostPath volume. HostPath volumes mount a file or directory from the host node’s filesystem into a Pod. Similarly a Local Persistent Volume mounts a local disk or partition into a Pod.

The biggest difference is that the Kubernetes scheduler understands which node a Local Persistent Volume belongs to. With HostPath volumes, a pod referencing a HostPath volume may be moved by the scheduler to a different node resulting in data loss. But with Local Persistent Volumes, the Kubernetes scheduler ensures that a pod using a Local Persistent Volume is always scheduled to the same node.

While HostPath volumes may be referenced via a Persistent Volume Claim (PVC) or directly inline in a pod definition, Local Persistent Volumes can only be referenced via a PVC. This provides additional security benefits since Persistent Volume objects are managed by the administrator, preventing Pods from being able to access any path on the host.

Additional benefits include support for formatting of block devices during mount, and volume ownership using fsGroup.

What’s New With GA?

Since 1.10, we have mainly focused on improving stability and scalability of the feature so that it is production ready.

The only major feature addition is the ability to specify a raw block device and have Kubernetes automatically format and mount the filesystem. This reduces the previous burden of having to format and mount devices before giving it to Kubernetes.

Limitations of GA

At GA, Local Persistent Volumes do not support dynamic volume provisioning. However there is an external controller available to help manage the local PersistentVolume lifecycle for individual disks on your nodes. This includes creating the PersistentVolume objects, cleaning up and reusing disks once they have been released by the application.

How to Use a Local Persistent Volume?

Workloads can request a local persistent volume using the same PersistentVolumeClaim interface as remote storage backends. This makes it easy to swap out the storage backend across clusters, clouds, and on-prem environments.

First, a StorageClass should be created that sets volumeBindingMode: WaitForFirstConsumer to enable volume topology-aware scheduling. This mode instructs Kubernetes to wait to bind a PVC until a Pod using it is scheduled.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

Then, the external static provisioner can be configured and run to create PVs for all the local disks on your nodes.

$ kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM  STORAGECLASS   REASON      AGE
local-pv-27c0f084   368Gi      RWO            Delete           Available          local-storage              8s
local-pv-3796b049   368Gi      RWO            Delete           Available          local-storage              7s
local-pv-3ddecaea   368Gi      RWO            Delete           Available          local-storage              7s

Afterwards, workloads can start using the PVs by creating a PVC and Pod or a StatefulSet with volumeClaimTemplates.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: local-test
spec:
  serviceName: "local-service"
  replicas: 3
  selector:
    matchLabels:
      app: local-test
  template:
    metadata:
      labels:
        app: local-test
    spec:
      containers:
      - name: test-container
        image: k8s.gcr.io/busybox
        command:
        - "/bin/sh"
        args:
        - "-c"
        - "sleep 100000"
        volumeMounts:
        - name: local-vol
          mountPath: /usr/test-pod
  volumeClaimTemplates:
  - metadata:
      name: local-vol
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 368Gi

Once the StatefulSet is up and running, the PVCs are all bound:

$ kubectl get pvc
NAME                     STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS      AGE
local-vol-local-test-0   Bound    local-pv-27c0f084   368Gi      RWO            local-storage     3m45s
local-vol-local-test-1   Bound    local-pv-3ddecaea   368Gi      RWO            local-storage     3m40s
local-vol-local-test-2   Bound    local-pv-3796b049   368Gi      RWO            local-storage     3m36s

When the disk is no longer needed, the PVC can be deleted. The external static provisioner will clean up the disk and make the PV available for use again.

$ kubectl patch sts local-test -p '{"spec":{"replicas":2}}'
statefulset.apps/local-test patched

$ kubectl delete pvc local-vol-local-test-2
persistentvolumeclaim "local-vol-local-test-2" deleted

$ kubectl get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                            STORAGECLASS   REASON      AGE
local-pv-27c0f084   368Gi      RWO            Delete           Bound       default/local-vol-local-test-0   local-storage              11m
local-pv-3796b049   368Gi      RWO            Delete           Available                                    local-storage              7s
local-pv-3ddecaea   368Gi      RWO            Delete           Bound       default/local-vol-local-test-1   local-storage              19m

You can find full documentation for the feature on the Kubernetes website.

What Are Suitable Use Cases?

The primary benefit of Local Persistent Volumes over remote persistent storage is performance: local disks usually offer higher IOPS and throughput and lower latency compared to remote storage systems.

However, there are important limitations and caveats to consider when using Local Persistent Volumes:

Using local storage ties your application to a specific node, making your application harder to schedule. Applications which use local storage should specify a high priority so that lower priority pods, that don’t require local storage, can be preempted if necessary.
If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. Manual intervention, external controllers, or operators may be needed to recover from these situations.
While most remote storage systems implement synchronous replication, most local disk offerings do not provide data durability guarantees. Meaning loss of the disk or node may result in loss of all the data on that disk

For these reasons, local persistent storage should only be considered for workloads that handle data replication and backup at the application layer, thus making the applications resilient to node or data failures and unavailability despite the lack of such guarantees at the individual disk level.

Examples of good workloads include software defined storage systems and replicated databases. Other types of applications should continue to use highly available, remotely accessible, durable storage.

How Uber Uses Local Storage

M3, Uber’s in-house metrics platform, piloted Local Persistent Volumes at scale in an effort to evaluate M3DB — an open-source, distributed timeseries database created by Uber. One of M3DB’s notable features is its ability to shard its metrics into partitions, replicate them by a factor of three, and then evenly disperse the replicas across separate failure domains.

Prior to the pilot with local persistent volumes, M3DB ran exclusively in Uber-managed environments. Over time, internal use cases arose that required the ability to run M3DB in environments with fewer dependencies. So the team began to explore options. As an open-source project, we wanted to provide the community with a way to run M3DB as easily as possible, with an open-source stack, while meeting M3DB’s requirements for high throughput, low-latency storage, and the ability to scale itself out.

The Kubernetes Local Persistent Volume interface, with its high-performance, low-latency guarantees, quickly emerged as the perfect abstraction to build on top of. With Local Persistent Volumes, individual M3DB instances can comfortably handle up to 600k writes per-second. This leaves plenty of headroom for spikes on clusters that typically process a few million metrics per-second.

Because M3DB also gracefully handles losing a single node or volume, the limited data durability guarantees of Local Persistent Volumes are not an issue. If a node fails, M3DB finds a suitable replacement and the new node begins streaming data from its two peers.

Thanks to the Kubernetes scheduler’s intelligent handling of volume topology, M3DB is able to programmatically evenly disperse its replicas across multiple local persistent volumes in all available cloud zones, or, in the case of on-prem clusters, across all available server racks.

Uber’s Operational Experience

As mentioned above, while Local Persistent Volumes provide many benefits, they also require careful planning and careful consideration of constraints before committing to them in production. When thinking about our local volume strategy for M3DB, there were a few things Uber had to consider.

For one, we had to take into account the hardware profiles of the nodes in our Kubernetes cluster. For example, how many local disks would each node cluster have? How would they be partitioned?

The local static provisioner README provides guidance to help answer these questions. It’s best to be able to dedicate a full disk to each local volume (for IO isolation) and a full partition per-volume (for capacity isolation). This was easier in our cloud environments where we could mix and match local disks. However, if using local volumes on-prem, hardware constraints may be a limiting factor depending on the number of disks available and their characteristics.

When first testing local volumes, we wanted to have a thorough understanding of the effect disruptions (voluntary and involuntary) would have on pods using local storage, and so we began testing some failure scenarios. We found that when a local volume becomes unavailable while the node remains available (such as when performing maintenance on the disk), a pod using the local volume will be stuck in a ContainerCreating state until it can mount the volume. If a node becomes unavailable, for example if it is removed from the cluster or is drained, then pods using local volumes on that node are stuck in an Unknown or Pending state depending on whether or not the node was removed gracefully.

Recovering pods from these interim states means having to delete the PVC binding the pod to its local volume and then delete the pod in order for it to be rescheduled (or wait until the node and disk are available again). We took this into account when building our operator for M3DB, which makes changes to the cluster topology when a pod is rescheduled such that the new one gracefully streams data from the remaining two peers. Eventually we plan to automate the deletion and rescheduling process entirely.

Alerts on pod states can help call attention to stuck local volumes, and workload-specific controllers or operators can remediate them automatically. Because of these constraints, it’s best to exclude nodes with local volumes from automatic upgrades or repairs, and in fact some cloud providers explicitly mention this as a best practice.

Portability Between On-Prem and Cloud

Local Volumes played a big role in Uber’s decision to build orchestration for M3DB using Kubernetes, in part because it is a storage abstraction that works the same across on-prem and cloud environments. Remote storage solutions have different characteristics across cloud providers, and some users may prefer not to use networked storage at all in their own data centers. On the other hand, local disks are relatively ubiquitous and provide more predictable performance characteristics.

By orchestrating M3DB using local disks in the cloud, where it was easier to get up and running with Kubernetes, we gained confidence that we could still use our operator to run M3DB in our on-prem environment without any modifications. As we continue to work on how we’d run Kubernetes on-prem, having solved such an important pending question is a big relief.

What’s Next for Local Persistent Volumes?

As we’ve seen with Uber’s M3DB, local persistent volumes have successfully been used in production environments. As adoption of local persistent volumes continues to increase, SIG Storage continues to seek feedback for ways to improve the feature.

One of the most frequent asks has been for a controller that can help with recovery from failed nodes or disks, which is currently a manual process (or something that has to be built into an operator). SIG Storage is investigating creating a common controller that can be used by workloads with simple and similar recovery processes.

Another popular ask has been to support dynamic provisioning using lvm. This can simplify disk management, and improve disk utilization. SIG Storage is evaluating the performance tradeoffs for the viability of this feature.

Source

The first release of Kubernetes in 2019 brings a highly anticipated feature – production-level support for Windows workloads. Up until now Windows node support in Kubernetes has been in beta, allowing many users to experiment and see the value of Kubernetes for Windows containers. While in beta, developers in the Kubernetes community and Windows Server team worked together to improve the container runtime, build a continuous testing process, and complete features needed for a good user experience. Kubernetes now officially supports adding Windows nodes as worker nodes and scheduling Windows containers, enabling a vast ecosystem of Windows applications to leverage the power of our platform.

As Windows developers and devops engineers have been adopting containers over the last few years, they’ve been looking for a way to manage all their workloads with a common interface. Kubernetes has taken the lead for container orchestration, and this gives users a consistent way to manage their container workloads whether they need to run on Linux or Windows.

The journey to a stable release of Windows in Kubernetes was not a walk in the park. The community has been working on Windows support for 3 years, delivering an alpha release with v1.5, a beta with v1.9, and now a stable release with v1.14. We would not be here today without rallying broad support and getting significant contributions from companies including Microsoft, Docker, VMware, Pivotal, Cloudbase Solutions, Google and Apprenda. During this journey, there were 3 critical points in time that significantly advanced our progress.

Advancements in Windows Server container networking that provided the infrastructure to create CNI (Container Network Interface) plugins
Enhancements shipped in Windows Server semi-annual channel releases enabled Kubernetes development to move forward – culminating with Windows Server 2019 on the Long-Term Servicing Channel. This is the best release of Windows Server for running containers.
The adoption of the KEP (Kubernetes Enhancement Proposals) process. The Windows KEP outlined a clear and agreed upon set of goals, expectations, and deliverables based on review and feedback from stakeholders across multiple SIGs. This created a clear plan that SIG-Windows could follow, paving the path towards this stable release.

With v1.14, we’re declaring that Windows node support is stable, well-tested, and ready for adoption in production scenarios. This is a huge milestone for many reasons. For Kubernetes, it strengthens its position in the industry, enabling a vast ecosystem of Windows-based applications to be deployed on the platform. For Windows operators and developers, this means they can use the same tools and processes to manage their Windows and Linux workloads, taking full advantage of the efficiencies of the cloud-native ecosystem powered by Kubernetes. Let’s dig in a little bit into these.

Operator Advantages

Gain operational efficiencies by leveraging existing investments in solutions, tools, and technologies to manage Windows containers the same way as Linux containers
Knowledge, training and expertise on container orchestration transfers to Windows container support
IT can deliver a scalable self-service container platform to Linux and Windows developers

Developer Advantages

Containers simplify packaging and deploying applications during development and test. Now you also get to take advantage of Kubernetes’ benefits in creating reliable, secure, and scalable distributed applications.
Windows developers can now take advantage of the growing ecosystem of cloud and container-native tools to build and deploy faster, resulting in a faster time to market for their applications
Taking advantage of Kubernetes as the leader in container orchestration, developers only need to learn how to use Kubernetes and that skillset will transfer across development environments and across clouds

CIO Advantages

Leverage the operational and cost efficiencies that are introduced with Kubernetes
Containerize existing.NET applications or Windows-based workloads to eliminate old hardware or underutilized virtual machines, and streamline migration from end-of-support OS versions. You retain the benefit your application brings to the business, but decrease the cost of keeping it running

“Using Kubernetes on Windows allows us to run our internal web applications as microservices. This provides quick scaling in response to load, smoother upgrades, and allows for different development groups to build without worry of other group’s version dependencies. We save money because development times are shorter and operation’s time is not spent maintaining multiple virtual machine environments,” said Jeremy, a lead devops engineer working for a top multinational legal firm, one of the early adopters of Windows on Kubernetes.

There are many features that are surfaced with this release. We want to turn your attention to a few key features and enablers of Windows support in Kubernetes. For a detailed list of supported functionality, you can read our documentation.

You can now add Windows Server 2019 worker nodes
You can now schedule Windows containers utilizing deployments, pods, services, and workload controllers
Out of tree CNI plugins are provided for Azure, OVN-Kubernetes, and Flannel
Containers can utilize a variety of in and out-of-tree storage plugins
Improved support for metrics/quotas closely matches the capabilities offered for Linux containers

When looking at Windows support in Kubernetes, many start drawing comparisons to Linux containers. Although some of the comparisons that highlight limitations are fair, it is important to distinguish between operational limitations and differences between the Windows and Linux operating systems. From a container management standpoint, we must strike a balance between preserving OS-specific behaviors required for application compatibility, and reaching operational consistency in Kubernetes across multiple operating systems. For example, some Linux-specific file system features, user IDs and permissions exposed through Kubernetes will not work on Windows today, and users are familiar with these fundamental differences. We will also be adding support for Windows-specific configurations to meet the needs of Windows customers that may not exist on Linux. The alpha support for Windows Group Managed Service Accounts is one example. Other areas such as memory reservations for Windows pods and the Windows kubelet are a work in progress and highlight an operational limitation. We will continue working on operational limitations based on what’s important to our community in future releases.

Today, Kubernetes master components will continue to run on Linux. That way users can add Windows nodes without having to create a separate Kubernetes cluster. As always, our future direction is set by the community, so more components, features and deployment methods will come over time. Users should understand the differences between Windows and Linux and utilize the advantages of each platform. Our goal with this release is not to make Windows interchangeable with Linux or to answer the question of Windows vs Linux. We offer consistency in management. Managing workloads without automation is tedious and expensive. Rewriting or re-architecting workloads is even more expensive. Containers provide a clear path forward whether your app runs on Linux or Windows, and Kubernetes brings an IT organization operational consistency.

As a community, our work is not complete. As already mentioned , we still have a fair bit of limitations and a healthy roadmap. We will continue making progress and enhancing Windows container support in Kubernetes, with some notable upcoming features including:

Support for CRI-ContainerD and Hyper-V isolation, bringing hypervisor-level isolation between pods for additional security and extending our container-to-node compatibility matrix
Additional network plugins, including the stable release of Flannel overlay support
Simple heterogeneous cluster creation using kubeadm on Windows

We welcome you to get involved and join our community to share feedback and deployment stories, and contribute to code, docs, and improvements of any kind.

Read our getting started and contributor guides, which include links to the community meetings and past recordings, at https://github.com/kubernetes/community/tree/master/sig-windows
Explore our documentation at https://kubernetes.io/docs/setup/windows
Join us on Slack or the Kubernetes Community Forums to chat about Windows containers on Kubernetes.

Thank you and feel free to reach us individually if you have any questions.

Michael Michael
SIG-Windows Chair
Director of Product Management, VMware
@michmike77 on Twitter
@m2 on Slack

Patrick Lang
SIG-Windows Chair
Senior Software Engineer, Microsoft
@PatrickLang on Slack