You know what’s worse than getting paged at 3am? Getting paged at 3am for something you can’t fix.
One of the best tools in my toolkit as the manager of an ops team to prevent these is what we call a “Page-Out Meeting.” This is a meeting where myself a few of my team members (usually the primary and secondary on call folks, but they’re open to anyone) get together and review each and every page that went out since the last one.
We talk about if the page was actionable or not, what they had to do to fix it, and how do we keep it from paging again. (An “actionable page” is one that lead to the person getting paged taking action to correct it.)
- If the page was unactionable, we figure out how can we make it not page out again.
- If the page was actionable, we talk about how can we prevent it from happening in the future.
I keep a page on our internal Wiki that lists out the time and date that the page occurred, a link to the actual alert in PagerDuty, what the page text was (useful for searching later!), notes on what we did to correct it, and then links to any tasks in Jira that were opened in followup to correcting the problem.
We repeat this process weekly! Sometimes it takes a lot of time, but I find that it’s time well spent. Guarding the sleep of my team is super important to me!
Questions we Ask
For each page (actionable or not), we ask ourselves a bunch of questions.
Did it page with the right severity? If it’s a 24/7 alert, does it need to be? Could we make it business hours only?
We’ve found that many times alerts are set to 24/7 when they really don’t need to be.
Often times moving an alert to business hours only is sufficient for what we’re actually alerting on. (For example, if we have a cluster with 20 nodes in it, we can stand to drop a few without any impact to our customers, so let’s just wait and deal with it when we’re awake.)
If it’s a 24/7 alert, what could we do it make it business hours?
Machines are (usually) cheaper than people, and almost always cheaper than downtime. Can we throw hardware as the problem to reduce the severity of this page in the future? If so, let’s do that, and drop this alert down to business hours.
This requires two thresholds in your monitoring most of the time. Page 24/7 if a threshold passes 90%, or business hours at 50%, for example.
Does this actually need to be a page-out? Is this alert the sole indicator of a problem, or just additional information?
Sometimes we find that just having a script drop a note in a Slack channel is good enough. (And maybe light up something on our monitoring dashboard.)
An example here is a ToR (top of rack switch) throwing interface errors. Yeah, we wanna know about that, and we’ll look at it, but if nothing else is alarming, it’s not worthy of waking someone up at 3am.
We were the right team to be paged for this alert? Should this have gone to a different team?
Sometimes the action for an alert is nothing more than “page another team.” If that’s the case, we’ll flip things around and fix the alert to go to the other team first, and they can page us if it’s an ops issue.
This is usually the case with alerts that are unactionable for us, but actionable by another team.
Could a change prevent this page in the future?
If we think a code or config change could prevent the page in the future, we’ll open a task in Jira with either ourselves or the correct team to try to get it in motion.
As the manager, it’s then my job to follow up on the freshly created task to make sure it gets prioritized and done. Since we have a paper trail of each page, I’ve had really good luck going to other teams and saying “Hey, this thing has paged out four times last week. Here’s the log. It’s 50% of the pages that have been waking us up. Can it get looked at?” Other managers have been suuuper responsive when I present things in those terms. (A lot of times they aren’t even aware it’s a problem, and as soon as you’re able to express it in human cost, they get it.)
I’ve found page-out meetings to be a super good use of my team’s time. Some weeks they’re time consuming, but most weeks they aren’t. They’re the best tool I’ve found for actually reducing the number of pages that wake people up.
The documentation trail they leave is great. I’m able to tell my team members “I’m sorry last week was so rough, here’s what we’re doing about it,” and they give me a tool I can use with other managers to help show why a task is worthy of their team’s time. (Yay for empathy!)
I hope you find them useful as well!