Docker containers provides an isolated sandbox for the containerized program to execute. One-shot containers accomplishes a particular task and stops. Long running containers runs for an indefinite period till it either gets stopped by the user or when the root process inside container crashes. It is necessary to gracefully handle container’s death and to make sure that the Job running as container does not get impacted in an unexpected manner. When containers are run with Swarm orchestration, Swarm monitors the containers health, exit status and the entire lifecycle including upgrade and rollback. This will be a pretty long blog. I did not want to split it since it makes sense to look at this holistically. You can jump to specific sections by clicking on the links below if needed. In this blog, I will cover the following topics with examples:
- Handling Signals and exit codes
- Container restart policy
- Container health check
- Service restart with Swarm
- Service health check
- Service rolling upgrade and rollback
Handling Signals and exit codes
When we pass a signal to container using Docker CLI, Docker passes the signal to the main process running inside container(PID-1). This link has the list of all Linux signals. Docker exit codes follow the chroot exit standard for Docker defined exit codes. Other standard exit codes can come from the program running inside container. Container exit code can be seen from container events coming from Docker daemon when the container exits. For containers that have not been cleaned up, exit code can be found from “docker ps -a”.
Following is a sample “docker ps -a” output where nginx container exited with exit code 0. Here, I used “docker stop” to stop the container.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
32d675260384 nginx “nginx -g ‘daemon …” 18 seconds ago Exited (0) 7 seconds ago web
Following is a sample “docker ps -a” output where nginx container exited with exit code 137. Here, I used “docker kill” to stop the container.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9b5d8348cb89 nginx “nginx -g ‘daemon …” 11 seconds ago Exited (137) 2 seconds ago web
Following is the list of standard and Docker defined exit codes:
0: Success
125: Docker run itself fails
126: Contained command cannot be invoked
127: Containerd command cannot be found
128 + n: Fatal error signal n:
130: (128+2) Container terminated by Control-C
137: (128+9) Container received a SIGKILL
143: (128+15) Container received a SIGTERM
255: Exit status out of range(-1)
Following is a simple Python program that handles Signals. This program will be run as Docker container to illustrate Docker signals and exit codes.
#!/usr/bin/python
import sys
import signal
import time
def signal_handler_int(sigid, frame):
print “signal”, sigid, “,”, “Handling Ctrl+C/SIGINT!”
sys.exit(signal.SIGINT)
def signal_handler_term(sigid, frame):
print “signal”, sigid, “,”, “Handling SIGTERM!”
sys.exit(signal.SIGTERM)
def signal_handler_usr(sigid, frame):
print “signal”, sigid, “,”, “Handling SIGUSR1!”
sys.exit(0)
def main():
# Register signal handler
signal.signal(signal.SIGINT, signal_handler_int)
signal.signal(signal.SIGTERM, signal_handler_term)
signal.signal(signal.SIGUSR1, signal_handler_usr)
while True:
print “I am alive”
sys.stdout.flush()
time.sleep(1)
# This is the standard boilerplate that calls the main() function.
if __name__ == ‘__main__’:
main()
Following is the Dockerfile to convert this to container:
FROM python:2.7
COPY ./signalexample.py ./signalexample.py
ENTRYPOINT [“python”, “signalexample.py”]
Lets build the container:
docker build –no-cache -t smakam/signaltest:v1 .
Lets start the container:
docker run -d –name signaltest smakam/signaltest:v1
We can watch the logs from container using docker logs:
docker logs -f signaltest
The Python program above handles SIGINT, SIGTERM and SIGUSR1. We can pass these signals to the container using Docker CLI.
Following command sends SIGINT to the container:
docker kill –signal=SIGINT signaltest
In the Docker logs, we can see the following to show that this signal is handled:
signal 2 , Handling Ctrl+C/SIGINT!
Following output shows the container exit status:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c06266e79a43 smakam/signaltest:v1 “python signalexam…” 36 seconds ago Exited (2) 3 seconds ago signaltest
Following command sends SIGTERM to the container:
docker kill –signal=SIGTERM signaltest
In the Docker logs, we can see the following to show that this signal is handled:
signal 15 , Handling SIGTERM!
Following output shows the container exit status:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0149708f42b2 smakam/signaltest:v1 “python signalexam…” 10 seconds ago Exited (15) 2 seconds ago signaltest
Following command sends SIGUSR1 to the container:
docker kill –signal=SIGUSR1 signaltest
In the Docker logs, we can see the following to show that this signal is handled:
signal 15 , Handling SIGUSR1!
Following output shows the container exit status:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c92f7b4dd45b smakam/signaltest:v1 “python signalexam…” 12 seconds ago Exited (0) 2 seconds ago signaltest
When we execute “docker stop “, Docker first sends SIGTERM signal to the container, waits for some time and then sends SIGKILL. This is done so that the program executing inside the container can use the SIGTERM signal to do the graceful shutdown of the program.
Common mistake in Docker signal handling
In the above example, the python program runs as PID 1 inside container since we used the EXEC form of ENTRYPOINT in Dockerfile. If we use the background method of ENTRYPOINT, shell process runs as PID 1 and the python program runs as another process. Following is a sample Dockerfile for starting the program as background process.
FROM python:2.7
COPY ./signalexample.py ./signalexample.py
ENTRYPOINT python signalexample.py
In this example, Docker passes the signal to the shell process instead of to the Python program. This causes the python program to not see the signal sent to the container. If there are multiple processes running inside the container and we need to pass the signal, 1 possible approach is to run the ENTRYPOINT as a script, handle the signal in the script and pass it to the correct process. 1 example using this approach is mentioned here.
Difference between “docker stop”, “docker rm” and “docker kill”
“docker stop” – Sends SIGTERM to container, waits some time for process to handle it and then sends SIGKILL. Container filesystem remains intact.
“docker kill” – Sends SIGKILL directly. Container filesystem remains intact.
“docker rm” – Removes container filesystem. “docker rm -f” will send SIGKILL and then remove container filesystem.
Using “docker run” with “–rm” option will automatically remove containers including container filesystem when the container exits.
When container exits without the container filesystem getting removed, we can still restart the container.
Container restart policy
Container restart policy controls the restart actions when Container exits. Following are the supported restart options:
- no – This is default. Containers do not get restarted when they exit.
- on-failure – Containers restart only when there is a failure exit code. Any exit code other than 0 is treated as failure.
- unless-stopped – Containers restart as long as it was not manually stopped by user.
- always – Always restart container irrespective of exit status.
Following is an example of starting “signaltest” container with restart policy of “on-failure” and retry count of 3. Retry count 3 is the number of restarts that will be done by Docker before giving up.
docker run -d –name=signaltest –restart=on-failure:3 smakam/signaltest:v1
To show the restart happening, we can manually try to send signals to the container. In the “signaltest” example, signals SIGTERM, SIGINT and SIGKILL will cause non-zero exit code and SIGUSR1 will cause zero exit code. 1 thing to remember is that restart does not work if we stop the container or send signals using “docker kill”. I think this is because there must be an explicit check added by Docker to prevent restart in these cases since the action is triggered by user.
Lets send SIGINT to the container by passing the signal to the process. We can find the process id by doing “ps -eaf | grep signalexample” in host machine.
kill -s SIGINT
Lets check the “docker ps” output. We can see that the “created” time is 50 seconds. Uptime is less than a second because container restarted.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b867543b110c smakam/signaltest:v1 “python signalexam…” 50 seconds ago Up Less than a second
Following command shows the restart policy and the restart count for the running container. In this example, container restart happened once.
$ docker inspect signaltest | grep -i -A 2 -B 2 restart
“Name”: “/signaltest”,
“RestartCount”: 1,
“RestartPolicy”: {
“Name”: “on-failure”,
“MaximumRetryCount”: 3
To illustrate that restart does not work on exit code 0, lets send SIGUSR1 to the container that will cause exit code 0.
sudo kill -s SIGUSR1
In this case, container exits, but it does not get restarted.
Container restart does not work with “–rm” option. This is because “–rm” option causes container to be removed as soon as the container exit happens.
Container health check
It is possible that container does not exit but it not performing as per the requirement. Health check probes can be used to identify misbehaving containers and take action rather than waiting till the end when container dies. Health check probes are used to accomplish the specific task of checking container health. For a container like webserver, health check probe can be as simple as sending curl request to webserver port. By using container’s health, we can restart the container if health check fails.
To illustrate health check feature, I have used the container described here.
Following command starts the webserver container with health check capability enabled.
docker run -p 8080:8080 -d –rm –name health-check –health-interval=1s –health-timeout=3s –health-retries=3 –health-cmd “curl -f http://localhost:8080/health || exit 1” effectivetrainings/docker-health
Following are all parameters related to healthcheck:
Following “docker ps” output shows container health status:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
947dad1c1412 effectivetrainings/docker-health “java -jar /app.jar” 28 seconds ago Up 26 seconds (healthy) 0.0.0.0:8080->8080/tcp health-check
This container has a backdoor approach to mark container health as unhealthy. Lets use the backdoor approach to mark container as unhealthy like below:
curl “http://localhost:8080/environment/health?status=false”
Now, lets check the “docker ps” output. The container’s health has now become unhealthy.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
947dad1c1412 effectivetrainings/docker-health “java -jar /app.jar” 3 minutes ago Up 3 minutes (unhealthy) 0.0.0.0:8080->8080/tcp health-check
Service restart with Swarm
Docker Swarm mode introduces a higher level of abstraction called Service and containers are part of the service. When we create a service, we specify the number of containers that needs to be part of the service using “replicas” parameter. Docker swarm will monitor the number of replicas and if any container dies, Swarm will create new container to keep the replica count as requested by the user.
Below command can be used to create signal service with 2 container replicas:
docker service create –name signaltest –replicas=2 smakam/signaltest:v1
Following command output shows the 2 containers that are part of “signaltest” service:
$ docker service ps signaltest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
vsgtopkkxi55 signaltest.1 smakam/signaltest:v1 ubuntu Running Running 36 seconds ago
dbbm05w91wv7 signaltest.2 smakam/signaltest:v1 ubuntu Running Running 36 seconds ago
Following parameters control the container restart policy in a service:
Lets start the “signaltest” service with restart-condition of “on-failure”:
docker service create –name signaltest –replicas=2 –restart-condition=on-failure –restart-delay=3s smakam/signaltest:v1
Remember that sending signal “SIGTERM”, “SIGINT”, “SIGKILL” causes non-zero container exit codes and sending “SIGUSR1” causes zero container exit code.
Lets first send SIGTERM to 1 of the 2 containers:
docker kill –signal=SIGTERM
Following is the “signaltest” service output that shows the 3 containers including the one that has exited with non-zero status:
$ docker service ps signaltest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
35ndmu3jbpdb signaltest.1 smakam/signaltest:v1 ubuntu Running Running 4 seconds ago
ullnsqio5151 _ signaltest.1 smakam/signaltest:v1 ubuntu Shutdown Failed 11 seconds ago “task: non-zero exit (15)”
2rfwgq0388mt signaltest.2 smakam/signaltest:v1 ubuntu Running Running 49 seconds ago
Following command sends SIGUSR1 signal to 1 of the containers which causes container to exit with status 0.
docker kill –signal=SIGUSR1
Following command shows that the container did not restart since the container exit code is 0.
$ docker service ps signaltest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
35ndmu3jbpdb signaltest.1 smakam/signaltest:v1 ubuntu Running Running 52 seconds ago
ullnsqio5151 _ signaltest.1 smakam/signaltest:v1 ubuntu Shutdown Failed 59 seconds ago “task: non-zero exit (15)”
2rfwgq0388mt signaltest.2 smakam/signaltest:v1 ubuntu Shutdown Complete 3 seconds ago
$ docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
xs8lzbqlr69n signaltest replicated 1/2 smakam/signaltest:v1
I don’t see a real need to change the default Swarm service restart policy from “any”.
Service health check
In the previous sections, we saw how to use container health check with “effectivetrainings/docker-health” container. Even though we could detect the container as unhealthy, we could not restart the container automatically. For standalone containers, Docker does not have native integration to restart the container on health check failure though we can achieve the same using Docker events and a script. Health check is better integrated with Swarm. With health check integrated to Swarm, when a container in a service is unhealthy, Swarm automatically shuts down the unhealthy container and starts a new container to maintain the container count as specified in the replica count of a service.
“docker service” command provides following options for health check and associated behavior.
Lets create “swarmhealth” service with 2 replicas of “docker-health” containers.
docker service create –name swarmhealth –replicas 2 -p 8080:8080 –health-interval=2s –health-timeout=10s –health-retries=10 –health-cmd “curl -f http://localhost:8080/health || exit 1” effectivetrainings/docker-health
Following output shows the “swarmhealth” service output and the 2 healthy containers:
$ docker service ps swarmhealth
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
jg8d78inw97n swarmhealth.1 effectivetrainings/docker-health:latest ubuntu Running Running 21 seconds ago
l3fdz5awv4u0 swarmhealth.2 effectivetrainings/docker-health:latest ubuntu Running Running 19 seconds ago
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d9b1f1b0a9b0 effectivetrainings/docker-health:latest “java -jar /app.jar” About a minute ago Up About a minute (healthy) swarmhealth.1.jg8d78inw97nmmbdtjzrscg1q
bb15bfc6e588 effectivetrainings/docker-health:latest “java -jar /app.jar” About a minute ago Up About a minute (healthy) swarmhealth.2.l3fdz5awv4u045g2xiyrbpe2u
Lets mark 1 of the container unhealthy using backdoor command:
curl “http://:8080/environment/health?status=false”
Following output shows that 1 of the containers that has been shutdown which is the unhealthy container and 2 more running replicas. 1 of the replicas got restarted after the other container became unhealthy.
$ docker service ps swarmhealth
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
ixxvzyuyqmcq swarmhealth.1 effectivetrainings/docker-health:latest ubuntu Running Running 4 seconds ago
jg8d78inw97n _ swarmhealth.1 effectivetrainings/docker-health:latest ubuntu Shutdown Failed 23 seconds ago “task: non-zero exit (143): do…”
l3fdz5awv4u0 swarmhealth.2 effectivetrainings/docker-health:latest ubuntu Running Running 5 minutes ago
Service upgrade and rollback
When we have new versions of service to be updated without taking service downtime, Docker provides many controls to do the upgrade and rollback. For example, we can control parameters like number of tasks to upgrade at a single time, actions on upgrade failure, delay between task upgrades etc. This helps us achieve release patterns like Blue green and Canary deployment patterns.
Following options are provided by Docker in “docker service” command to control rolling upgrade and rollback.
Rolling upgrade:
Rollback:
To illustrate service upgrade, I have a simple python webserver program running as container.
Following is the Python program:
#!/usr/bin/python
import sys
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
import urlparse
import json
class GetHandler(BaseHTTPRequestHandler):
def do_GET(self):
message = “You are using version 1n”
self.send_response(200)
self.end_headers()
self.wfile.write(message)
return
def main():
server = HTTPServer((”, 8000), GetHandler)
print ‘Starting server at http://localhost:8000’
server.serve_forever()
# This is the standard boilerplate that calls the main() function.
if __name__ == ‘__main__’:
main()
This is the Dockerfile to create the Container:
FROM python:2.7
COPY ./webserver.py ./webserver.py
ENTRYPOINT [“python”, “webserver.py”]
I have 2 versions of Container, smakam/webserver:v1 and smakam/webserver:v2. The only difference is the message output that either shows “You are using version 1” or “You are using version 2”.
Lets create version 1 of the service with 2 replicas:
docker service create –name webserver –replicas=2 -p 8000:8000 smakam/webserver:v1
We can access the service using script. The service request will get load balanced between the 2 replicas.
while true; do curl -s “localhost:8000”;sleep 1;done
Following is the service request output that shows we are using version 1 of the service:
You are using version 1
You are using version 1
You are using version 1
Lets upgrade to version 2 of the web service. Since we have specified update-delay of 3 seconds, there will be a 3 second gap between upgrades of 2 replicas. Since the “update-parallelism” default is 1, only 1 task will be upgraded at 1 time.
docker service update –update-delay=3s –image=smakam/webserver:v2 webserver
Following is the service request output output that shows the request slowly getting migrated to version 2 as the upgrade happens 1 replica at a time.
You are using version 1
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 2
You are using version 2
Now, lets rollback to version 1 of the webserver:
docker service update –rollback webserver
Following is the service request output output that shows the request slowly getting downgraded from version 2 to version 1.
You are using version 2
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 2
You are using version 1
You are using version 1
Please let me know your feedback and if you want to see more details on any specific topic related to this. I have put the code associated with this blog here. The containers used in this blog(smakam/signaltest, smakam/webserver) are in Docker hub.