Built-In Locks for Kubernetes Workloads

July 31, 2023


I’ve been working on a project that interacts with an external web endpoint to query a list of available resources, select a free resource, and then mark it as reserved. The upstream system doesn’t have any type of built-in read locking, so this process is naturally prone to the following race condition:

  1. User A looks up a list of available resources
  2. While User A is still reviewing the list, User B also looks up a list of available resources.
  3. User A selects resource with ID 123
  4. User B also selects resource with ID 123
  5. A race condition ensues, and the last writer will “win” the resource (but the first user will be none the wiser)

This is a pretty classic problem, and it could be solved in a variety of ways if I’m interacting directly with a database that supports transactions (and most should). However, it’s a bit more challenging when the upstream, web-based system has no concept of such transactional locking.

In my case, the script I’m writing will only ever run from a single system. Therefore, the easiest way to solve this is with a simple, file based lock:

#!/bin/bash

LOCKFILE="/var/run/my-script.lock"

if [[ -f "$LOCKFILE" ]]; then
  echo "Cannot run as lockfile exists at $LOCKFILE." >&2
else
  touch "$LOCKFILE"
  # Do some work
  rm "$LOCKFILE"
fi

A file-based lock is absolutely fine for this use case: I have a script that runs infrequently, is very unlikely to be run by multiple callers at the same time, and (most importantly) runs from a single system. I’m a big fan of avoiding premature optimization, but it’s also good to think about the future, even if it’s just for the benefit of the thought experiment.

I could use a distributed locking mechanism, such as Apache Zookeeper or a database system with strong write guarantees. However, I already anticipated this workload would run in Kubernetes at some point, so I began to wonder if Kubernetes has a built-in primitive to handle this use case. And it turns out it does: leases!

From the documentation:

Distributed systems often have a need for leases, which provide a mechanism to lock shared resources and coordinate activity between members of a set. In Kubernetes, the lease concept is represented by Lease objects in the coordination.k8s.io API Group, which are used for system-critical capabilities such as node heartbeats and component-level leader election.

In this article, I’ll discuss how to define a Kubernetes lease and supporting role-based access control mechanisms to support a simple distributed lock.

Role-Based Access Control

First, I’ll start with security (this is probably the first time I’ve ever written that sentence). Role-based access control (RBAC) is the security approach in Kubernetes, and like many Kubernetes things, it takes some getting used to.

Workloads in Kubernetes can access the Kubernetes API using a service account token that is automatically mounted within a Pod. This isn’t mandatory: if a workload has no reason to access the Kubernetes API, then it can be configured without a service account token. However, the default behavior is generally to mount a token within the Pod at /var/run/secrets/kubernetes.io/serviceaccount/token.

The basic flow for granting a workload permissions to the Kubernetes API consists of four steps:

  1. Define a ServiceAccount. This is just a friendly name for the service account assigned to a workload.
  2. Define a Role. This defines the actual API permissions, such as permitted endpoints and the HTTP verbs that can be executed against them.
  3. Define a RoleBinding. This ties the ServiceAccount to the Role.
  4. Assign the ServiceAccount to a workload, such as a Pod.

The result of this spaghetti elegant configuration is a token that workloads inside of a Pod can use to access the Kubernetes API with whatever permissions are defined by the Role.

ServiceAccount, Role, and RoleBinding resources are namespaced. ClusterRole and ClusterRoleBinding are also available for non-namespaced resources that exist across the entire cluster. See the documentation for more information.

The creation order doesn’t matter for these resources, but I like to start with the ServiceAccount:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: lease-demo

The Role is a bit more complex. The workload must be able to create a new Lease, as well as delete the Lease it created. Unfortunately, the API endpoint to create a lease is /apis/coordination.k8s.io/v1/namespaces/{namespace}/leases, and it doesn’t support restricting the resourceName that can be created. This is kind of silly (I’m sure there’s a technical reason for it), and it means the permissions for Lease creation must be overly permissive: they allow for the creation of any Lease. However, the delete permissions can be restricted.

The following Role allows for the creation of any Lease, but it only allows a Lease named demo-lock to be deleted:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: lease-demo
rules:
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create"]

- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  resourceNames: ["demo-lock"]
  verbs: ["delete"]

Finally, the RoleBinding ties the lease-demo ServiceAccount to the lease-demo Role. I’ve named everything the same here, but the names don’t fundamentally matter as long as they are listed correctly in the roleRef and the subjects section of the RoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: lease-demo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: lease-demo
subjects:
- kind: ServiceAccount
  name: lease-demo

I now have a lease-demo ServiceAccount with all of the permissions necessary to manage a simple lock. Next, it’s time to apply it to a workload.

The Workload

Now that the Kubernetes-specific configuration is in place to support Lease management, we can build a workload that leverages Leases for locking purposes. A simple web request using the service account token (from /var/run/secrets/kubernetes.io/serviceaccount/token) is enough to create and delete a Lease. My actual implementation uses a Python script, but a simple cURL example is sufficient for demonstration purposes:

curl \
  --fail \
  -X POST \
  --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  -H "Content-Type: application/json" \
  --data '{"apiVersion":"coordination.k8s.io/v1","kind":"Lease","metadata":{"name":"demo-lock","namespace":"default"}}' \
  https://kubernetes.default.svc.cluster.local/apis/coordination.k8s.io/v1/namespaces/default/leases

To break down this cURL request for those who have never interacted directly with the Kubernetes API:

  • --fail - Causes curl to exit non-0 if it doesn’t receive a successful HTTP status code
  • -X POST - Send a POST request (this is a create operation)
  • --cacert - Verify the server’s certificate using the CA certificate that is automatically mounted in the Pod alongside the service account token
  • -H Authorization... - Pass an HTTP header called “Authorization” with a value of Bearer ${TOKEN}, where ${TOKEN} is the actual service account token from /var/run/secrets/kubernetes.io/serviceaccount/token
  • Content-Type: application/json - This is a JSON request
  • --data - The JSON data sent to the API. This contains the parameters to create the Lease, and is the JSON representation of the same YAML that can be used to create a Lease using the kubectl command.
  • https://kubernetes.default.svc.cluster.local/apis/coordination.k8s.io/v1/namespaces/default/leases - The Lease API endpoint

Deleting a Lease is similar: just send a DELETE request to the lease’s endpoint:

curl \
  --fail 
  -X DELETE \
  --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  -H "Content-Type: application/json" \
  https://kubernetes.default.svc.cluster.local/apis/coordination.k8s.io/v1/namespaces/default/leases/demo-lock

Using these fundamental concepts, I built a reasonably robust script to simulate acquiring a Lease, running a workload, and releasing the Lease:

#!/bin/bash

# Install curl. Never actually do this in a production container.
if ! curl --version >/dev/null 2>&1
then
  apt update >/dev/null 2>&1
  apt install -y curl >/dev/null 2>&1
fi

delete_lease() {
  # If we don't currently hold the Lease, then don't do anything.
  # This check allows us to use this function in an exit trap also, since we only want to delete a lease if we currently hold it.
  if ! $LEASE_HELD
  then
    return 0
  fi
  
  if curl \
    --fail \
    -X DELETE \
    --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
    -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
    -H "Content-Type: application/json" \
    https://kubernetes.default.svc.cluster.local/apis/coordination.k8s.io/v1/namespaces/default/leases/demo-lock \
    >/dev/null 2>&1
  then
    echo "Lock deleted"
    return 0
  else
    echo "Unable delete lock"
    return 1
  fi
}

delete_lease_and_exit() {
  delete_lease
  exit $?
}

LEASE_HELD=false
TIMEOUT_SECONDS=120
TIME_WAITED=0

while [ "$TIME_WAITED" -lt "$TIMEOUT_SECONDS" ]
do
  # Acquire a lease
  if curl \
      --fail \
      -X POST \
      --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
      -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
      -H "Content-Type: application/json" \
      --data '{"apiVersion":"coordination.k8s.io/v1","kind":"Lease","metadata":{"name":"demo-lock","namespace":"default"}}' \
      https://kubernetes.default.svc.cluster.local/apis/coordination.k8s.io/v1/namespaces/default/leases \
      >/dev/null 2>&1
  then

    # Catch exits and ensure the lock is cleaned up (e.g., if the script is killed) to reduce the chance of deadlocks.
    trap delete_lease_and_exit EXIT

    # Simulate doing some work for a random number of seconds between 1 and 15
    LEASE_HELD=true
    SLEEP_TIME=$(shuf -i 1-15 -n 1)
    echo "Got lease, doing some work for $SLEEP_TIME seconds"
    sleep $SLEEP_TIME

    # Delete the lease, clear the exit trap (so it doesn't fire when we exit), and exit
    delete_lease
    EXIT_CODE=$?
    trap - EXIT
    exit $EXIT_CODE
  else
    # If we can't get a lease (someone else is holding the lock), then wait for a bit.
    echo "Unable to get lease. Sleeping for 5 seconds then retrying"
    sleep 5
    TIME_WAITED=$(( TIME_WAITED + 5 ))
  fi
done

echo "Unable to acquire lease after $TIMEOUT_SECONDS seconds."
exit 1

I put this entire script into a ConfigMap for demonstration purposes. In practice, I would build this into the actual container image, but mounting a ConfigMap is sufficient for this example:

$ kubectl create configmap --from-file=script.sh lease-script

I then define a simple Job to execute the script. The Job uses the ServiceAccount with the permissions needed to create and delete the lease:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: lease-demo
spec:
  template:
    spec:
      serviceAccountName: lease-demo
      restartPolicy: OnFailure
      containers:
      - image: ubuntu:latest
        name: script-pod
        command:
          - bash
          - /script/script.sh
        volumeMounts:
          - name: lease-script
            mountPath: "/script"
            readOnly: true
      volumes:
      - name: lease-script
        configMap:
          name: lease-script
  backoffLimit: 4

I can try to run two instances of this Job simultaneously. However, the locking mechanism ensures that only one Job can run at a time, while the other Job simply waits for the Lease to be available:

# Create a job called lease-demo, then create a duplicate job called lease-demo2
$ kubectl apply -f Job.yaml && sed 's/  name: lease-demo/  name: lease-demo2/' Job.yaml | kubectl apply -f -

# Check on the status of the lease-demo2 Pod. It looks like it won the lease!
$ kubectl logs job/lease-demo2
Got lease, doing some work for 6 seconds
Lock deleted

# The lease-demo Pod lost the race, so it needs to wait. Finally, it acquires the lease and does some work.
$ kubectl logs job/lease-demo
Unable to get lease. Sleeping for 5 seconds then retrying
Unable to get lease. Sleeping for 5 seconds then retrying
Got lease, doing some work for 10 seconds
Lock deleted

Wrapping Up

I’m generally very bearish about writing software that is locked to a platform, such as a cloud provider’s services or the Kubernetes API. However, in this case, the tradeoff likely makes sense: the use case is too simple to build and maintain a separate distributed locking mechanism, the impact of vendor lock-in is minimal since this is only a basic script, and the provided Kubernetes API meets all of my needs. It’s always nice to find primitives that are included with your chosen platform, and I was pleasantly surprised to discover that Kubernetes includes this helpful Lease mechanism.