Signals and the "kubectl delete" command

October 18, 2021

Some colleagues and I were recently implementing a Chaos Monkey style test against a Kubernetes deployment. The goal was to forcibly kill an application to understand how it behaved. Specifically, we were looking to see if the application engaged in some atomic I/O operations that were safe, even if they ungracefully terminated while data was being processed. To do this, we needed to make sure the process (and, by extension, the container running the process) was forcibly terminated without an opportunity to gracefully run any shutdown routines.

For those who remember Ye Olden Days when we wrote and tested applications without wrapping them in a set of namespaces and cgroups that are created by a runtime controlled by a constantly-evolving API with an increasingly complex set of interfaces, you know this is generally an easy problem to solve: Just spin up many copies of the service and write a one-liner to kill -9 $PID && sleep 1 each of those processes. Or forcibly stop the VMs that the service is running on. Or walk over to a rack of servers and unplug it. It’s not perfect, but it’ll do the job if you’re just looking to violently terminate processes and see how a system behaves.

But we’re #webscale now, so nothing can be simple.

Short version of this article: There’s no way to accurately simulate a true failure in a Kubernetes environment unless you have access to the underlying nodes. Empirical evidence indicates that kubernetes and/or kubectl don’t offer a way to immediately send a SIGKILL to a pod. The documentation is very unclear, which adds to the confusion.

As an aside: If you want to play a fun game, log into that one single point of failure server that your company has at 4:30PM on a Friday and give the ole’ SysAdmin Roulette a whirl: kill -9 $(ps -ef | tail -n +2 | awk '{ print $2 }' | shuf | head -n 1). This is also a great way to get your company’s Very Important Person to stop chasing blockchain, machine learning, AI, or whatever other startup snakeoil they think you need, and convince them to actually fix problems in your current environment. Anyway…

Some Fundamentals

If you’re unfamiliar with signals, here’s a crash course: signals are essentially a standardized message sent to a process. Processes can generally decide how they want to handle different signals (except SIGKILL and SIGSTOP), but there’s some standardization. The signals that I’m interested in are SIGTERM and SIGKILL. Sending a SIGTERM to a process gives it a chance to gracefully terminate. The process will usually execute some cleanup tasks and then exit. Sending a SIGKILL to a process will immediately terminate the process, giving it no opportunity to clean up after itself.

For my team’s experiment, we ideally wanted to send a SIGKILL to our Kubernetes pods to test how they behave in ungraceful shutdown scenarios. The kubectl delete command is used to delete resources, such as pods. It provides a --grace-period flag, ostensibly for allowing you to give a pod a certain amount of time to gracefully terminate (SIGTERM) before it’s forcibly killed (SIGKILL). If you review the help menu for kubectl delete, you’ll find the following relevant bits:

      --force=false: If true, immediately remove resources from API and bypass graceful deletion.
Note that immediate deletion of some resources may result in inconsistency or data loss and requires
confirmation.

      --grace-period=-1: Period of time in seconds given to the resource to terminate gracefully.
Ignored if negative. Set to 1 for immediate shutdown. Can only be set to 0 when --force is true
(force deletion).

This isn’t really clear. Does --grace-period=1 result in immediate shutdown via a SIGKILL, or does it give the pod a 1 second grace period? Does --grace-period=0 --force=true send an immediate SIGKILL, or does it just remove the resource from the Kubernetes API? It’s all entirely unclear from the docs, so I ran some experiments to find out more.

Test Setup

To figure out how this behavior works, I used the following setup:

minikube version 1.23.2
Kubernetes server version 1.22.2
kubectl version 1.22.2.

$ minikube version --components
minikube version: v1.23.2
commit: 0a0ad764652082477c00d51d2475284b5d39ceed

buildctl:
buildctl github.com/moby/buildkit v0.9.0 c8bb937807d405d92be91f06ce2629e6202ac7a9

containerd:
containerd github.com/containerd/containerd v1.4.9 e25210fe30a0a703442421b0f60afac609f950a3

crictl:
crictl version v1.21.0

crio:
crio version 1.22.0

crun:
error

ctr:
ctr github.com/containerd/containerd v1.4.9

docker:
Docker version 20.10.8, build 3967b7d

dockerd:
Docker version 20.10.8, build 75249d8

podman:
podman version 2.2.1

runc:
runc version 1.0.1

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}

Testing with `--grace-period=1`

The documentation for kubectl specifically states that --grace-period should be “Set to 1 for immediate shutdown.” To me, this would indicate that a grace period of 1 results in…well, immediate shutdown, like the documentation says. In the world of *nix, this means that the process is sent a SIGKILL and not given a chance to gracefully terminate.

Let’s put that to the test. First, I’ll start a simple busybox pod that just sleeps forever (I output timestamps on everything so that I can follow the flow):

$ date -u +%R:%S && kubectl run --image=busybox busybox sleep infinity
23:58:45
pod/busybox created

Next, I’ll connect to my minikube host (via minikube ssh) and fire up an strace on the process ID of the sleeping container.

$ strace --absolute-timestamps -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid)
strace: Process 44880 attached
23:58:48 restart_syscall(<... resuming interrupted nanosleep ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

Finally, I’ll just send a kubectl delete with a grace-period of 1 and observe the strace output:

# kubectl delete command
$ date -u +%R:%S && kubectl delete pod busybox --grace-period=1
23:59:04
pod "busybox" deleted

# strace output from minikube
23:59:04 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
23:59:04 restart_syscall(<... resuming interrupted restart_syscall ...>) = ?
23:59:05 +++ killed by SIGKILL +++

Notice that a SIGTERM is received, followed immediately by a SIGKILL one second later. This is unexpected behavior: the docs indicate that --grace-period=0 results in immediate shutdown, which clearly isn’t the case. The use of a SIGTERM gives the process a chance to gracefully exit, which is undesirable for my tests.

Testing with `--grace-period=0` and `--force=true`

The other option provided by the documentation is to use --grace-period=0 and --force=true. Again, the docs are unclear about what will actually happen here. They state that --grace-period can only be set to 0 “when --force is true (force deletion).” The docs further explain that a force deletion will “immediately remove resources from API and bypass graceful deletion.” Basically, this indicates that the resource will be removed from Kubernetes before it has received confirmation that the resource itself (e.g., a container) has actually been deleted.

Once again, the documentation is unclear about the behavior (does --grace-period-0 result in a SIGKILL?), so I tested it out with the same experiment:

# Create the pod
$ date -u +%R:%S && kubectl run --image=busybox busybox sleep infinity
00:01:59
pod/busybox created

# Trace the pid
$ strace -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid) --absolute-timestamps
strace: Process 45803 attached
00:02:08 restart_syscall(<... resuming interrupted nanosleep ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

# Force delete the pod with no grace period
$ date -u +%R:%S && kubectl delete pod busybox --grace-period=0 --force=true
00:02:18
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "busybox" force deleted

# Observe the strace output
00:02:18 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
00:02:18 restart_syscall(<... resuming interrupted restart_syscall ...>) = ?
00:02:48 +++ killed by SIGKILL +++

Whoa. Thirty seconds between receiving a SIGTERM and finally terminating via SIGKILL? That doesn’t sound like --grace-period=0 to me. So it turns out that specifying --grace-period=0 and --force=true might actually provide more of a grace period than you would expect.

But why?

I now know that neither --grace-period=1 nor --grace-period=0 --force=true behave “correctly” based on the documentation. The weirdest thing about this behavior is that it’s totally unnecessary. Docker (and I imagine other runtimes, like containerd) supports sending SIGKILL to a container:

# Kill the container
$ date -u +%R:%S && docker kill $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ')
01:25:45
8c59ac684bf2

# Observe the strace output
$ strace --absolute-timestamps -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid)
strace: Process 9122 attached
01:25:32 restart_syscall(<... resuming interrupted nanosleep ...>) = ?
01:25:45 +++ killed by SIGKILL +++

Notice that the container is immediately terminated via a SIGKILL. This is expected behavior, and is what any person should expect for an “immediate shutdown.”

It turns out that I’m not the first one to come across this problem. An issue was opened almost 2 years ago pointing out that, at a minimum, the documentation should be corrected to accurately reflect the behavior of Kubernetes. The issue was largely ignored and then autoclosed.

Does it matter?

All of this sounds very academic: how often do administrators really care about the shutdown signals sent to their processes? And why can’t you just run the underlying docker kill commands (or the equivalents in other runtimes)? Does it really matter that Kubernetes improperly implements “immediate shutdown” and then doesn’t explain this in the documentation?

It matters to anyone looking to test their system to ensure it behaves properly in failure scenarios. If a Kubernetes node fails due to a hardware issue, it probably isn’t going to use its dying breaths to politely send SIGTERMs to every pod. It’s just going to fail, and you need to understand how your system will handle that failure. Without being able to actually simulate this behavior, you can’t be confident that your system will degrade in the way you expect.

It’s tempting to tell an administrator to just log into the underlying hosts and simulate a failure, either via a docker kill or by physically terminating the machine. Aside from this being silly (Kubernetes should just implement signals properly), it’s not always possible: many organizations pay for hosted Kubernetes and have no access to the underlying nodes.

More broadly, Kubernetes is often billed by supporters as a “distributed operating system.” Process management is an integral part of an operating system, and if you can’t reason about how Kubernetes handles process termination, then it’s not much of an operating system. These are the kinds of “small things” that always end up mattering, so it’s probably just a good idea to implement them correctly from the beginning.

But really I’m just armchair quarterbacking: I’m not trying to take shots at the Kubernetes project. The goal of this article is just to spread awareness about the fact that you can’t simulate failure scenarios using only Kubernetes tooling. You need access to the underlying nodes, or your failure simulations won’t be accurate.

Anthony Critelli