A quick note about why my uptime check alerts were firing daily.
Background: My Infrastructure
I use a kubernetes cluster to manage my personal infrastructure, and to keep costs down, I take advantage of preemptible machines from Google Cloud Platform. Preemptible machines are a way for Google to allow developers to use extra computational resources that they’ve got sitting around doing nothing. These virtual machines last for a maximum of 24 hours, and can be killed by Google at any time.
But why, Ryder, would you ever use such a machine?
There’s a simple reason – it’s extremely cheap, around an 80% discount from regular (on-demand) virtual machines. I use them in my Kubernetes cluster, since Kubernetes is able to handle migration of workloads very easily – that’s what it is for.
I organize my kubernetes cluster in two main node pools:
- The Stateful Pool
- Comprised of on-demand machines, running mission-critical services like databases, and high value sites
- The Preemptible Pool
- Comprised of preemptible machines running stuff I don’t really care about, like old projects or low traffic sites
What This Means
Since I use preemptible machines, it means that every day a portion of my production-facing infrastructure is killed off completely, and my cluster recovers.
Since I have some deployments which run only one replica (not the greatest idea for site reliability, but I don’t care about those side projects), it means that there will be about 30 seconds to 2-minutes of downtime while these pods are reassigned to new nodes (virtual machines).
The cluster handles this automatically, but I have uptime checks with stackdriver to monitor whether or not a site went down.
I set up stackdriver uptime checks and alerting policies for every site deployed on my kubernetes cluster – the problem is that I was getting slack messages, emails, and phone notifications every day about things that would resolve a minute later.
Developers should only be notified if something requires their attention, and this most certainly does not require mine.
After checking my default alerting policies on my uptime checks, I noticed something. The default configuration for Condition Triggers If: was actually Any Time Series Violates, and my rolling window was set to 5 minutes. I originally interperated this to mean, “if the site has been failing for at least 5 minutes, notify me”.
What this actually meant was “if the site went down at all within the past 5 minutes, send me a notification”.
The fix? Simple: Set Condition Triggers If to All Time Series Violates, instead of Any Time Series Violates. Easy peasy.
Hopefully this helps someone who might not be aware as to why their uptime checks on StackDriver are failing all the time.