Getting your applications running on Kubernetes is one thing: keeping them up and running is another thing entirely. While the goal is to deploy applications that never fail, the reality is that applications often crash, terminate, or restart with little warning. Even before that point, applications can have less visible problems like memory leaks, network latency, and disconnections. To prevent applications from behaving unexpectedly, we need a way of continually monitoring them. That's where liveness probes come in.
In this blog, we'll explain what liveness probes are, how they work, and how Gremlin checks your entire Kubernetes environment for missing or incomplete liveness probe declarations.
What are liveness probes and why are they important?
A liveness probe is a periodic check to see if a container has failed and, if so, whether to restart it. It's essentially a health check that periodically sends an HTTP request (or sends a command) to a container and waits for a response. If the response doesn't arrive, or the container returns a failure, the probe triggers a restart of the container.
The power of liveness probes is in their ability to detect container failures and automatically restart failed containers. This recovery mechanism is built into Kubernetes itself without the need for a third-party tool. Service owners can define liveness probes as part of their deployment manifests, and their containers will always be deployed with liveness probes. In theory, the only time a service owner should have to manually check their containers is if the liveness probe fails to restart a container (like the dreaded CrashLoopBackOff
state).
How do I address missing liveness probes?
Defining a liveness probe for each container takes just a few lines of YAML, and you don't need to change anything about how your application or container works.
For example, let's add a liveness probe to an Nginx deployment. When the container is up and running, it exposes an HTTP endpoint on port 80. Since other applications will communicate with Nginx over port 80, it makes sense to create a liveness probe that checks this port's availability:
1apiVersion: apps/v12kind: Deployment3metadata:4 name: nginx5spec:6 replicas: 17 selector:8 matchLabels:9 app: nginx10 template:11 metadata:12 labels:13 app: nginx14 spec:15 containers:16 - name: nginx17 image: nginx:latest18 livenessProbe:19 httpGet:20 path: /21 port: 8022 initialDelaySeconds: 1023 periodSeconds: 3
The section we're looking at in particular is:
1livenessProbe:2 httpGet:3 path: /4 port: 805 initialDelaySeconds: 606 periodSeconds: 3
If we break this down:
httpGet
indicates this is probe issues HTTP requests. There are also liveness probes that run commands, send TCP requests, and send gRPC requests.path
andport
are the URL and port number that we want to send the request to, respectively.initialDelaySeconds
is the amount of time to wait between deploying the container and running the first probe. This is to give the container time to start up so we avoid false positives.periodSeconds
is how often to run the probe after the initial delay.
Put together, this means that after 60 seconds have elapsed, Kubernetes will send an HTTP request to port 80 every 3 seconds. As long as the container returns a status code between 200 and 400, Kubernetes considers the container to be healthy. If it returns an error code or can't be contacted at all, Kubernetes restarts the container.
How do I validate that I'm resilient?
After you've deployed your liveness probe, you can use Gremlin to ensure that it works as expected. Gremlin's Detected Risks feature automatically detects high-proirity reliability issues like missing liveness probes. You can also use Gremlin's fault injection toolkit to run Chaos Engineering experiments and cause your liveness probes to report an error.
Imagine we've deployed Nginx with its liveness probe. First, we can check to make sure the liveness probe exists by querying the Pod and looking for the Liveness
line. For example:
1kubectl describe pod nginx-85d94dc89-x4lxk
1Name: nginx-85d94dc89-x4lxk2Namespace: default3...4Status: Running5IP: 10.42.0.966Controlled By: ReplicaSet/nginx-85d94dc897Containers:8 nginx:9 Liveness: http-get http://:80/ delay=60s timeout=1s period=30s #success=1 #failure=3
Now that we've confirmed the probe is part of our deployment, let's run a Chaos Engineering experiment to test what happens when the probe gets tripped.
Using fault injection to validate your fix
With Gremlin, we can add just enough latency to the container's network connection to trip the liveness probe. The container won't know that the latency is generated by Gremlin, and will treat it as real latency. If we add enough latency to trip the 1 second timeout, we should see the liveness probe fail and the container restart.
To test this:
- Log into the Gremlin web app at app.gremlin.com.
- Select Experiments in the left-hand menu and select New Experiment.
- Select Kubernetes, then select our Nginx Pod.
- Expand Choose a Gremlin, select the Network category, then select the Latency experiment.
- Increase MS to 1000. This is the amount of latency to add to each network packet in milliseconds. Since the probe is set to time out after one second, this guarantees that any response sent from Nginx to Kubernetes takes at least that long.
- Increase Length to 120 seconds or higher. Remember: the liveness probe will hold for 60 seconds while waiting for the pod to finish starting. We want to run our experiment long enough to exceed that delay.
- Click Run Experiment to start the experiment.
Now, let's keep an eye on our Nginx Pod. In just a few seconds, we'll see the pod restart automatically.
1kubectl get pods
1NAMESPACE NAME READY STATUS RESTARTS AGE2default nginx-6df47656c8-vxq6z 1/1 Running 1 (5m ago) 32m
What similar risks should I be looking for?
Kubernetes has two additional types of probes: startup probes and readiness probes.
Startup probes grant containers extra startup time by letting you set both a probe period (in seconds) and a failure threshold, which is the amount of times Kubernetes will run the probe before killing the container. For example, if you set a period of 30 seconds and a failure threshold of 30, the probe will run every 10 seconds up to 30 times, leaving 5 minutes (300 seconds) for the application to start. It's important to note that liveness probes won't run until after the startup probe finishes.
Readiness probes work similarly to liveness probes and handle applications that are running, but not yet ready to receive traffic. Readiness probes prevent traffic from other containers until the application is ready to process that traffic. For example, our Nginx container might start up within 10 seconds, but what if we had to load a massive configuration file that took an additional 30 seconds? If we just set a startup probe to allow for 10 seconds, other applications might send requests to the Nginx container while it's still processing its configuration. Readiness probes prevent this from happening.
While you're free to use all three probe types for your containers, the Kubernetes docs explain when you might prefer one over the other. In short:
- Use a readiness probe when you want to avoid sending traffic to a Pod until it's ready to process traffic.
- Use a liveness probe to detect critical container errors that might not be detected by Kubernetes.
- Use a startup probe with containers that take a long time to start, and may interfere with a liveness probe.
We'll cover more Kubernetes Detected Risks in the future. In the meantime, if you're ready to scan your own Kubernetes environment for reliability risks like these, give Gremlin a try. You can sign up for a free 30-day trial, and after installing our Gremlin agent, get a complete report of your reliability risks.
For more on liveness probes, check out our tutorial: How to validate liveness probes using Gremlin.