Hidden Gems – Jetstack Blog

27/Mar 2018

Coming up to four years since its initial launch, Kubernetes is now at version 1.10. Congratulations to the many contributors and the release team on another excellent release!

At Jetstack, we push Kubernetes to its limits, whether engaging with customers on their own K8s projects, training K8s users of all levels, or contributing our open source developments to the K8s community. We follow the project day-to-day, and track its development closely.

You can read all about the headline features of 1.10 at the official blog post. But, in keeping with our series of release gem posts, we asked our team of engineers to share a feature of 1.10 that they find particularly exciting, and that they’ve been watching and waiting for (or have even been involved in!)

Device Plugins

Matt Turner

The Device Plugin system is now beta in Kubernetes 1.10. This essentially allows Nodes to be sized along extra, arbitrary dimensions. These represent any special hardware they might have over and above CPU and RAM capacity. For example, a Node might specify that it has 3 GPUs and a high-performance NIC. A Pod could then request one of those GPUs through the standard resources stanza, causing it to be scheduled on a node with a free one. A system of plugins and APIs handles advertising and initialising these resources before they are handed over to Pods.

nVidia has already made a plugin for managing their GPUs. A request for 2 GPUs would look like:

resources:
limits:
nvidia.com/gpu: 2

CoreDNS

Charlie Egan

1.10 makes cluster DNS a pluggable component. This makes it easier to use other tools for service discovery. One such option is CoreDNS, a fellow CNCF project, which has a native ‘plugin’ that implements the Kubernetes service discovery spec. It also runs as a single process that supports caching and health checks (meaning there’s no need for the dnsmasq or healthz containers in the DNS pod).

The CoreDNS plugin was promoted to beta in 1.10 and will eventually become the Kubernetes default. Read more about using CoreDNS here.

Pids per Pod limit

Luke Addison

A new Alpha feature in 1.10 is the ability to control the total number of PIDs per Pod. The Linux kernel provides the process number controller which can be attached to a cgroup hierarchy in order to stop any new tasks from being created after a specified limit is reached. This kernel feature is now exposed to cluster operators. This is vital for avoiding malicious or accidental fork bombs which can devastate clusters.

In order to enable this feature, operators should define SupportPodPidsLimit=true in the kubelet’s –feature-gates= parameter. The feature currently only allows operators to define a single maximum limit per Node by specifying the –pod-max-pids flag on the kubelet. This may be a limitation for some operators as this static limit cannot work for all workloads and there may be legitimate use cases for exceeding it. For this reason, we may see the addition of new flags and fields in the future to make this limit more dynamic; one possible addition is the ability of operators to specify a low and a high PID limit and allowing customers to choose which one they want to use by setting a boolean field on the Pod spec.

It will be very exciting to see how this feature develops in subsequent releases as it provides another important isolation mechanism for workloads.

Louis Taylor

1.10 adds alpha support for shared process namespaces in a pod. To try it out, operators must enable it with the PodShareProcessNamespace feature flag set on both the apiserver and kubelet.

When enabled, users can set shareProcessNamespace on a pod spec:

apiVersion: v1
kind: Pod
metadata:
name: shared-pid
spec:
shareProcessNamespace: true
…

Sharing the PID namespace inside a pod has a few effects. Most prominently, processes inside containers are visible to all other containers in the pod, and signals can be sent to processes across container boundaries. This makes sidecar containers more powerful (for example, sending a SIGHUP signal to reload configuration for an application running in a separate container is now possible).

CRD Sub-resources

Josh Van Leeuwen

With 1.10 comes a new alpha feature to include Subresources to Custom Resources, namely /status and /scale. Just like other resource types, they provide separate API endpoints to modify their contents. This not only means that your resource now interacts with systems such as HorizontalPodAutoscaler, but it also enables greater access control of user spec and controller status data. This is a great feature to ensure users are unable to change or destroy resource states that are needed by your custom controllers.

To enable both the /status and /scale subresources include the following into your Custom Resource Definition:

subresources:
status: {}
scale:
specReplicasPath: .spec.replicas
statusReplicasPath: .status.replicas
labelSelectorPath: .status.labelSelector

External Custom Metrics

Matt Bates

The first version of HPA (v1) was only able to scale based on observed CPU utilisation. Although useful for some cases, CPU is not always the most suitable or relevant metric to autoscale an application. HPA v2, introduced in 1.6, is able to scale based on custom metrics. Read more about Resource Metrics API, the Custom Metrics API and HPA v2 in this blog post in our Kubernetes 1.8 Hidden Gems series.

Custom metrics can describe metrics from the pods that are targeted by the HPA, resources (e.g. CPU or memory), or objects (say, a Service or Ingress). But these options are not suited to metrics that relate to infrastructure outside of a cluster. In a recent customer engagement, there was a desire to scale pods based on Google Cloud Pub/Sub queue length, for example.

In 1.10, there is now an extension (in alpha) to the HPA v2 API to support external metrics. So, for example, we may have an HPA to serve the aforementioned Pub/Sub autoscaling requirement that looks like the following:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta2
spec:
scaleTargetRef:
kind: ReplicationController
name: Worker
minReplicas: 2
maxReplicas: 10
metrics:
– type: External
external:
metricName: queue_messages_ready
metricSelector:
matchLabels:
queue: worker_tasks
targetAverageValue: 30

This HPA would require an add-on API server, registered as an APIService, which implements the Custom Metrics API and query Pub/Sub for the metric.

Custom kubectl ‘get’ and ‘describe’ output

James Munnelly

Kubernetes 1.10 brings a small but important change to the way the output for kubectl get and kubectl describe is generated.

In the past, third party extensions to Kubernetes like Cert-Manager and Navigator would always display something like this:

$ kubectl get certificates
NAME AGE
prod-tls 4h

With this change however, we can configure our extensions to display more helpful output when querying our custom resource types. For example:

$ kubectl get certificates
NAME STATUS EXPIRY ISSUER
prod-tls Valid 2018-05-03 letsencrypt-prod

$ kubectl get elasticsearchclusters
NAME HEALTH LEADERS DATA INGEST
logging Green 3/3 4/4 2/2

This brings a native feel to API extensions, and provides users an easy way to quickly identify meaningful data points about their resources at a glance.

Volume Scheduling and Local Storage

Richard Wall

We’re excited to see that local storage is promoted to a beta API group and volume scheduling is enabled by default in 1.10.

There are a couple of related API changes:

PV has a new PersistentVolume.Spec.NodeAffinity field, whose value should contain the hostname of the local node.
StorageClass has a new StorageClass.volumeBindingMode: WaitForFirstConsumer option, which makes Kubernetes delay the binding of the volume until it has considered and resolved all the pods scheduling constraints, including the constraints on the PVs that match the volume claim.

We’re already thinking about how we can use these features in a Navigator managed Cassandra Database on Kubernetes. With some tweaks to the Navigator code, it will now be much simpler to run C* nodes with their commit log and sstables on dedicated local SSDs. If you add a PV for each available SSD, and if the PV has the necessary NodeAffinity configuration, Kubernetes will factor the locations of those PVs into its scheduling decisions and ensure that C* pods are scheduled to nodes with an unused SSD. We’ll write more about this in an upcoming blog post!

PS I’d recommend reading Provisioning Kubernetes Local Persistent Volumes which describes a really elegant mechanism for automatically discovering and preparing local volumes using DaemonSet and the experimental local PV provisioner.

Source