{"id":1530,"date":"2019-03-30T12:38:57","date_gmt":"2019-03-30T12:38:57","guid":{"rendered":"https:\/\/www.appservgrid.com\/paw93\/?p=1530"},"modified":"2019-04-06T01:43:18","modified_gmt":"2019-04-06T01:43:18","slug":"kube-proxy-subtleties-debugging-an-intermittent-connection-reset","status":"publish","type":"post","link":"https:\/\/www.appservgrid.com\/paw93\/index.php\/2019\/03\/30\/kube-proxy-subtleties-debugging-an-intermittent-connection-reset\/","title":{"rendered":"kube-proxy Subtleties: Debugging an Intermittent Connection Reset"},"content":{"rendered":"<p>I recently came across a bug that causes intermittent connection resets. After some digging, I found it was caused by a subtle combination of several different network subsystems. It helped me understand Kubernetes networking better, and I think it\u2019s worthwhile to share with a wider audience who are interested in the same topic.<\/p>\n<h2 id=\"the-symptom\"><strong>The symptom<\/strong><\/h2>\n<p>We received a user report claiming they were getting connection resets while using a Kubernetes service of type ClusterIP to serve large files to pods running in the same cluster. Initial debugging of the cluster did not yield anything interesting: network connectivity was fine and downloading the files did not hit any issues. However, when we ran the workload in parallel across many clients, we were able to reproduce the problem. Adding to the mystery was the fact that the problem could not be reproduced when the workload was run using VMs without Kubernetes. The problem, which could be easily reproduced by\u00a0<a href=\"https:\/\/github.com\/tcarmet\/k8s-connection-reset\" target=\"_blank\" rel=\"noopener\">a simple app<\/a>, clearly has something to do with Kubernetes networking, but what?<\/p>\n<h2 id=\"kubernetes-networking-basics\"><strong>Kubernetes networking basics<\/strong><\/h2>\n<p>Before digging into this problem, let\u2019s talk a little bit about some basics of Kubernetes networking, as Kubernetes handles network traffic from a pod very differently depending on different destinations.<\/p>\n<h3 id=\"pod-to-pod\"><strong>Pod-to-Pod<\/strong><\/h3>\n<p>In Kubernetes, every pod has its own IP address. The benefit is that the applications running inside pods could use their canonical port, instead of remapping to a different random port. Pods have L3 connectivity between each other. They can ping each other, and send TCP or UDP packets to each other.\u00a0<a href=\"https:\/\/github.com\/containernetworking\/cni\" target=\"_blank\" rel=\"noopener\">CNI<\/a>\u00a0is the standard that solves this problem for containers running on <strong>different hosts. There are tons of different plugins that support CNI.<\/strong><\/p>\n<h3 id=\"pod-to-external\"><strong>Pod-to-external<\/strong><\/h3>\n<p>For the traffic that goes from pod to external addresses, Kubernetes simply uses\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Network_address_translation\" target=\"_blank\" rel=\"noopener\">SNAT<\/a>. What it does is replace the pod\u2019s internal source IP:port with the host\u2019s IP:port. When the return packet comes back to the host, it rewrites the pod\u2019s IP:port as the destination and sends it back to the original pod. The whole process is transparent to the original pod, who doesn\u2019t know the address translation at all.<\/p>\n<h3 id=\"pod-to-service\"><strong>Pod-to-Service<\/strong><\/h3>\n<p>Pods are mortal. Most likely, people want reliable service. Otherwise, it\u2019s pretty much useless. So Kubernetes has this concept called \u201cservice\u201d which is simply a L4 load balancer in front of pods. There are several different types of services. The most basic type is called ClusterIP. For this type of service, it has a unique VIP address that is only routable inside the cluster.<\/p>\n<p>The component in Kubernetes that implements this feature is called kube-proxy. It sits on every node, and programs complicated iptables rules to do all kinds of filtering and NAT between pods and services. If you go to a Kubernetes node and type\u00a0<code>iptables-save<\/code>, you\u2019ll see the rules that are inserted by Kubernetes or other programs. The most important chains are\u00a0<code>KUBE-SERVICES<\/code>,\u00a0<code>KUBE-SVC-*<\/code>\u00a0and\u00a0<code>KUBE-SEP-*<\/code>.<\/p>\n<ul>\n<li><code>KUBE-SERVICES<\/code>\u00a0is the entry point for service packets. What it does is to match the destination IP:port and dispatch the packet to the corresponding\u00a0<code>KUBE-SVC-*<\/code>\u00a0chain.<\/li>\n<li><code>KUBE-SVC-*<\/code>\u00a0chain acts as a load balancer, and distributes the packet to\u00a0<code>KUBE-SEP-*<\/code>\u00a0chain equally. Every\u00a0<code>KUBE-SVC-*<\/code>\u00a0has the same number of<code>KUBE-SEP-*<\/code>\u00a0chains as the number of endpoints behind it.<\/li>\n<li><code>KUBE-SEP-*<\/code>\u00a0chain represents a Service EndPoint. It simply does DNAT, replacing service IP:port with pod\u2019s endpoint IP:Port.<\/li>\n<\/ul>\n<p>For DNAT, conntrack kicks in and tracks the connection state using a state machine. The state is needed because it needs to remember the destination address it changed to, and changed it back when the returning packet came back. Iptables could also rely on the conntrack state (ctstate) to decide the destiny of a packet. Those 4 conntrack states are especially important:<\/p>\n<ul>\n<li><em>NEW<\/em>: conntrack knows nothing about this packet, which happens when the SYN packet is received.<\/li>\n<li><em>ESTABLISHED<\/em>: conntrack knows the packet belongs to an established connection, which happens after handshake is complete.<\/li>\n<li><em>RELATED<\/em>: The packet doesn\u2019t belong to any connection, but it is affiliated to another connection, which is especially useful for protocols like FTP.<\/li>\n<li><em>INVALID<\/em>: Something is wrong with the packet, and conntrack doesn\u2019t know how to deal with it. This state plays a centric role in this Kubernetes issue.<\/li>\n<\/ul>\n<p>Here is a diagram of how a TCP connection works between pod and service. The sequence of events are:<\/p>\n<ul>\n<li>Client pod from left hand side sends a packet to a service: 192.168.0.2:80<\/li>\n<li>The packet is going through iptables rules in client node and the destination is changed to pod IP, 10.0.1.2:80<\/li>\n<li>Server pod handles the packet and sends back a packet with destination 10.0.0.2<\/li>\n<li>The packet is going back to the client node, conntrack recognizes the packet and rewrites the source address back to 192.169.0.2:80<\/li>\n<li>Client pod receives the response packet<\/li>\n<\/ul>\n<figure><img decoding=\"async\" src=\"https:\/\/d33wubrfki0l68.cloudfront.net\/3daa7bdcf874a16d1541edc8be8c298da316f0a1\/d75c0\/images\/blog\/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets\/good-packet-flow.png\" alt=\"Good packet flow\" width=\"100%\" \/><figcaption>Good packet flow<\/p>\n<\/figcaption><\/figure>\n<h2 id=\"what-caused-the-connection-reset\"><strong>What caused the connection reset?<\/strong><\/h2>\n<p>Enough of the background, so what really went wrong and caused the unexpected connection reset?<\/p>\n<p>As the diagram below shows, the problem is packet 3. When conntrack cannot recognize a returning packet, and mark it as\u00a0<em>INVALID<\/em>. The most common reasons include: conntrack cannot keep track of a connection because it is out of capacity, the packet itself is out of a TCP window, etc. For those packets that have been marked as\u00a0<em>INVALID<\/em>\u00a0state by conntrack, we don\u2019t have the iptables rule to drop it, so it will be forwarded to client pod, with source IP address not rewritten (as shown in packet 4)! Client pod doesn\u2019t recognize this packet because it has a different source IP, which is pod IP, not service IP. As a result, client pod says, \u201cWait a second, I don\u2019t recall this connection to this IP ever existed, why does this dude keep sending this packet to me?\u201d Basically, what the client does is simply send a RST packet to the server pod IP, which is packet 5. Unfortunately, this is a totally legit pod-to-pod packet, which can be delivered to server pod. Server pod doesn\u2019t know all the address translations that happened on the client side. From its view, packet 5 is a totally legit packet, like packet 2 and 3. All server pod knows is, \u201cWell, client pod doesn\u2019t want to talk to me, so let\u2019s close the connection!\u201d Boom! Of course, in order for all these to happen, the RST packet has to be legit too, with the right TCP sequence number, etc. But when it happens, both parties agree to close the connection.<\/p>\n<figure><img decoding=\"async\" src=\"https:\/\/d33wubrfki0l68.cloudfront.net\/caeba7f9b1115eda6863972df691e68364256312\/c41c0\/images\/blog\/2019-03-26-kube-proxy-subtleties-debugging-an-intermittent-connection-resets\/connection-reset-packet-flow.png\" alt=\"Connection reset packet flow\" width=\"100%\" \/><figcaption>Connection reset packet flow<\/p>\n<\/figcaption><\/figure>\n<h2 id=\"how-to-address-it\"><strong>How to address it?<\/strong><\/h2>\n<p>Once we understand the root cause, the fix is not hard. There are at least 2 ways to address it.<\/p>\n<ul>\n<li>Make conntrack more liberal on packets, and don\u2019t mark the packets as\u00a0<em>INVALID<\/em>. In Linux, you can do this by\u00a0<code>echo 1 &gt; \/proc\/sys\/net\/ipv4\/netfilter\/ip_conntrack_tcp_be_liberal<\/code>.<\/li>\n<li>Specifically add an iptables rule to drop the packets that are marked as\u00a0<em>INVALID<\/em>, so it won\u2019t reach to client pod and cause harm.<\/li>\n<\/ul>\n<p>The fix is drafted (<a href=\"https:\/\/github.com\/kubernetes\/kubernetes\/pull\/74840\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/kubernetes\/kubernetes\/pull\/74840<\/a>), but unfortunately it didn\u2019t catch the v1.14 release window. However, for the users that are affected by this bug, there is a way to mitigate the problem by applying the following rule in your cluster.<\/p>\n<div class=\"highlight\">\n<pre><code class=\"language-yaml\" data-lang=\"yaml\">apiVersion: extensions\/v1beta1\r\nkind: DaemonSet\r\nmetadata:\r\n  name: startup-script\r\n  labels:\r\n    app: startup-script\r\nspec:\r\n  template:\r\n    metadata:\r\n      labels:\r\n        app: startup-script\r\n    spec:\r\n      hostPID: true\r\n      containers:\r\n      - name: startup-script\r\n        image: gcr.io\/google-containers\/startup-script:v1\r\n        imagePullPolicy: IfNotPresent\r\n        securityContext:\r\n          privileged: true\r\n        env:\r\n        - name: STARTUP_SCRIPT\r\n          value: |\r\n            #! \/bin\/bash\r\n            echo 1 &gt; \/proc\/sys\/net\/ipv4\/netfilter\/ip_conntrack_tcp_be_liberal\r\n            echo done<\/code><\/pre>\n<\/div>\n<h2 id=\"summary\"><strong>Summary<\/strong><\/h2>\n<p>Obviously, the bug has existed almost forever. I am surprised that it hasn\u2019t been noticed until recently. I believe the reasons could be: (1) this happens more in a congested server serving large payloads, which might not be a common use case; (2) the application layer handles the retry to be tolerant of this kind of reset. Anyways, regardless of how fast Kubernetes has been growing, it\u2019s still a young project. There are no other secrets than listening closely to customers\u2019 feedback, not taking anything for granted but digging deep, we can make it the best platform to run applications.<\/p>\n<p><a href=\"https:\/\/kubernetes.io\/blog\/2019\/03\/29\/kube-proxy-subtleties-debugging-an-intermittent-connection-reset\/\" target=\"_blank\" rel=\"noopener\">Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently came across a bug that causes intermittent connection resets. After some digging, I found it was caused by a subtle combination of several different network subsystems. It helped me understand Kubernetes networking better, and I think it\u2019s worthwhile to share with a wider audience who are interested in the same topic. The symptom &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.appservgrid.com\/paw93\/index.php\/2019\/03\/30\/kube-proxy-subtleties-debugging-an-intermittent-connection-reset\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;kube-proxy Subtleties: Debugging an Intermittent Connection Reset&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-1530","post","type-post","status-publish","format-standard","hentry","category-kubernetes"],"_links":{"self":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1530","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/comments?post=1530"}],"version-history":[{"count":2,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1530\/revisions"}],"predecessor-version":[{"id":1600,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/posts\/1530\/revisions\/1600"}],"wp:attachment":[{"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/media?parent=1530"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/categories?post=1530"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.appservgrid.com\/paw93\/index.php\/wp-json\/wp\/v2\/tags?post=1530"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}