So you’ve had some downtime. Your customers were impacted for an extended period of time. You’re most likely are already writing a postmortem that’s circulated internally, but what about writing one for your customers, too?
A History Lesson
I work on a legacy product that’s been around for over a decade. When our product was new, we had a tendency to interact with our customers on a regular basis, both within the product, and on venues like web forums. Our customers felt like we were a part of the community, and that we cared about them.
But as the product aged, our involvement grew less and less. What was a steady flow of information slowly fell to just a trickle, and then eventually stopped. We went several years without any real form of community to our customers after outages, other of a posting on our status blog that things were broken, and then again when it was fixed.
This lack of communication made us seem like we didn’t care about our customers. This wasn’t actually the case, we cared about them very much, but they had no way to determine this. Contempt grows in a vacuum. Our product caters to creative individuals, whose imaginations were left to run wild.
This brings us to about a year ago. After an extended period of downtime, I decided the right thing to do was to tell our customers what happened. I took a risk and wrote a blog post explaining what happened (and saying sorry for the mess), got it approved from all the right people, and it was posted on our website.
The response from our customers was overwhelmingly positive.
Suddenly, the same people that loved to come up with conspiracy theories about what happened (often saying we’d done it on purpose), were happy! They welcomed the information with open arms. It got reposted on other blogs, and customers even wrote their own blog postings on what happened.
By taking just a few hours to write a postmortem and posting it for the world to see, we were able to generate a huge amount of goodwill towards not only the ops team, but the company in general.
We were suddenly seen as human again. We weren’t some huge corporation that didn’t care about our customers, we were a group of engineers that went to battle to keep our product going. They saw us as people. It was amazing.
This trend has continued after every major period of downtime. During the last round of it, I even saw someone mention on Twitter that they were looking forward to our write-up of what happened.
It’s a good feeling knowing that ops has been able to help make such a big difference in how our customers perceive our company. It’s all about goodwill, and saying we’re sorry and explaining what went wrong, has been a great source of it.
How to get started?
Writing customer postmortems is a bittersweet thing for ops. Writing one means that downtime occurred. We don’t like downtime. I’m always happy to write one, but hate that I have to do it.
The way I got started was to just simply write one after a major outage, and started passing it around. This worked well in the small company that I work for.
If you work for a larger organization, I’d start by having a conversation with your coworkers and manager about how you feel better communication is important. See if you can figure out where it would go on your website. Figure out what kind of buy-in you need post something publicly before there’s an outage, who would need to approve it, and who would be the person to actually get it posted.
But simply put, the best way I found was just to write one following an outage. Show people what’s possible and what you have in mind.
Timing
Timing is important. Obviously you shouldn’t write about it until after the situation is fully resolved, but you don’t want it to linger for weeks, either.
My goal is to write a customer postmortem the day of the outage, by the end of the day. This gives me a day or two to make changes and get approvals from all of the right people. Getting approvals often takes longer than writing the thing in the first place. Usually we post the customer postmortem within 24 hours of the situation being resolved.
Is it more important than your internal technical postmortem? That depends on your organization. Often a customer postmortem is easier to write than an internal technical one, since you don’t need a full list of everything that happened, but it’s easier to write the external one after the internal one has been written.
When to Write a Postmortem
So when should you write a postmortem? After every outage?
This is something you need to figure out as an ops team. If you write too much, you might give the impression that things are worse than they are, but if you don’t write enough, you appear to lack empathy.
I have a benchmark I use on when to write a customer postmortem: when more than half of customer base is impacted, for more than a few minutes.
Twitter (and your support team) is a great source of intelligence to gauge how upset your customers are. If there’s more than just a few tweets on an outage, I use that as a guide that we need to tell folks that happened.
Blameless
If you haven’t read John Allspaw’s piece on Blameless Postmortems, take a few minutes and do it now. John Allspaw is one of the greats in our industry.
In my opinion, it’s even more essential that customer postmortems are blameless. How something happened (or who) is completely unimportant to your customers, and it’s inappropriate to mention it. Your customers don’t care about the structure of your company – they care about the product they’re using.
Anything that’s posted on the Internet is going to be indexed by Google and saved for forever and ever. Never mention a person’s name. You could be messing with someone’s career if a future employer decides to search on a potential candidate’s name and find something damaging.
Just. Don’t. Do. It.
Blameless Extends to Vendors, too
As tempting as it might seem, don’t throw vendors under a bus. We don’t like downtime. When our product goes down, and it’s due to a fault in a vendor’s product or service, it can be pretty upsetting. (These are the types of outages I like the least, myself!)
Here’s a few things to keep in mind:
- Once the outage is over, that relationship with the vendor will continue. It’s in the best interest of your team, and ultimately your customers, to keep it a healthy one.
- As ops, it’s our job to engineer things such that the failure of a single vendor doesn’t cause downtime… and if that happens… yeah.
Write as if your vendors are going to be reading what you’re writing, because odds are, they’re going to.
Learn to Speak Marketing
This is hard for opsen.
Your organization most likely has a certain way they want to communicate with your customers. For example, some product names might always need to be written a certain way. Or, some terms are considered proper nouns and need to be capitalized. Remember that you’re speaking on behalf of the company.
Most organizations have a style guide, or maybe even people you can ask for help. My employer certainly does. They’re usually very willing and happy to work with you, so don’t be afraid to talk to them.
This was hard for me! The first time I wrote a customer-facing postmortem, our Communications Director had me make dozens up changes to my posting to match the corporate style. Every time I do it there’s been less and less corrections I need to make… to the point that on the most recent one, there weren’t any! Yay.
How Much Detail?
This can be tricky. I keep some things in mind when I write.
First, I’m speaking (largely) to a non-technical audience. They’re very smart, but they aren’t necessarily technical. So while it might be tempting to write something like “a slave MySQL host encountered an unexpected foreign key constraint conflict, halting replication, which in turn made reads to the cluster fall out of sync with the master,” bear in mind that the only people that are going to understand that are your co-workers. Instead, I’d say something like “there was an issue with the database.” Save the detailed explanation for your internal postmortem.
Second, and perhaps most importantly, remember that not everyone that’s going to read this has the best interests of your customers at heart. I try really hard not to tip our hand on how the problem was created, out of fear that someone may use this information against us.
You have to be really careful here. For example, if you lose a transit link, don’t tell people what capacity was lost. Circuits only come in so many sizes, and you might have accidentally just told someone that’s out to DDoS you how much bandwidth they need to knock you out. (Never, ever do this.)
There was one time when I specifically mentioned the CVE number that we had to interrupt services to patch against. This was because it was a hot issue (that was being mentioned on the news), and I wanted our customers to know “we got it, don’t worry about it.”
I try to explain what happened, while not giving specifics. Terms like “database cluster” are usually pretty safe. Just use your head and pretend you’re reading it with the intentions of doing harm. Ask your security team if you have any doubts!
Say How You’re Making it Better
Be sure and mention what your team is doing to keep this from happening in the future.
Just like before, a lot of level of detail isn’t needed. It could just be as simple as “we’re going to dig into the database logs and figure out what happened.”
You most likely list “next steps” on your internal technical postmortem. This might be a good place to start, if you can do so without compromising too much information.
During a string of outages at my company last year we had problems with the blog system where our support team communicates with customers going down. I made sure to mention this in my postmortem as a problem, even if it wasn’t directly tied to the outage itself. Customers being unable to get information during an outage is a big deal to them… maybe even more than the outage itself. (Vacuums breed contempt!)
Every postmortem I gave a followup on that issue, telling how it had (or hadn’t) gotten better. When it was finally resolved, I once again mentioned that we’d been listening to our customer’s feedback on it, and we were happy that things had improved.
It’s All About Empathy
This is really what it’s all about.
The student edition of the Merrian-Webster dictionary defines empathy as:
a being aware of and sharing another person’s feelings, experiences, and emotions; also : the ability for this
I don’t know of a single person that works in ops that enjoys downtime. We take pride in our work. We don’t like it when our customers are impacted. Downtime to us represents failure.
Part of being good at what we do is understanding that while we’re upset that things went down, our customers are impacted and upset, too. By publicly acknowledging to your customers that you understand they’re upset, and let them know that you’re upset too, you help build a bond between you. It’s that feeling of “we’re all in this together” that can drive empathy and help form a community.
Thanks for listening. Hopefully I’ve given you enough motivation to help make your ops team seem a little more human to your customers, too.
April <3