When one of your IT services is on fire there’s no time to waste. Especially if that fire is blocking your users from getting stuff done. Rapid resolution tends to eclipse all else during an incident, often causing your team to ignore or forget pieces of the incident response process – like keeping people in the loop.
It’s one of those little problems that compounds into a big one if not handled correctly. Pretty soon, you’re stuck in an endless loop of shoulder-taps and email threads, trying to explain to the CEO why things went wrong. While there’s no shortage of tools to help your team detect, alert, swarm on, and resolve incidents, even the best tools can’t replace clear communication to internal and external stakeholders.
And let’s be real: The stakes can be high. Very high. Reputation, customer attrition, time spent on damage control, just to name a few.
Luckily, downtime doesn’t have to turn into a customer service nightmare. Informed users are happy users. But first you need to know who to communicate to, how to reach them, and how to do it with the least friction and fewest resources possible.
Communication during times like this is like ripples from a rock tossed into a pond. The circles closest to the incident get the biggest, most frequent and most immediate feedback. This is your core on-call team – AKA the folks who need to identify and fix the problem. It’s a small circle, but the ripples (communication) need to be big, immediate, and frequent. As you move further from the core circle – to adjacent IT teams, managers, the organization as whole, end users and the general public – the audience gets bigger, but the ripples get smaller and less frequent.
Pick your communication solutions, channels, and message templates ahead of time
Professional support teams and site reliability engineers don’t decide on the fly what channels to communicate over. They make a plan ahead of time.
There are five main communication channels for incident communication:
- A dedicated status page
- Embedded status
- Workplace chat tool
- Social media
- SMS
Dedicated status page
We recommend teams use a dedicated status page as their primary incident communication solution. Whether you build it yourself or go with a hosted solution like Statuspage, it’s important to give your customers and colleagues a clear source of truth during an incident. Statuspage also gives your users an option to subscribe to get updates the moment they’re posted. This takes the support burden off teams who should be heads-down fixing the problem.
Embedded status
Statuspage also make it easy to embed status information directly onto any website customers operate. We know most visitors are likely to check a provider’s home page or support page before looking for a status page. The embedded widget (here’s an example) is an easy way of letting those visitors know if an incident is underway. Visitors can also click through on the widget to get to the status page.
You can give your audience the option to subscribe to email updates with a product like Statuspage at your disposal. Whether you’re sending directly from your email tool, or using a status page to trigger email sends, email a reliable channel for incident communication.
Chat tools
Reduce context switching and information gaps for employees and agents with Halp. Halp powers Jira Service Management to sync conversations in Slack or Microsoft Teams and your tickets. Seamless conversation between popular chat tools and support helps to provide robust context to a problem, leading to a fast resolution.
Social media
Many teams use social channels like Twitter as a means of communication during an incident. It’s good to use this as a piece of your strategy, but not rely on it as your only means of communication.
SMS
Receiving an SMS message, or text message, is often a more immediate way to reach someone, and a preference for many people when it comes to critical inbound alerts like a downtime announcement. It’s also a channel where people can be message fatigued very fast and will unsubscribe if they see too many messages that aren’t relevant to them.
None of these channels are a silver bullet for incident comms. They all have different strengths and the real power comes when you layer them together. For example, at Atlassian, we post incidents to a status page but also push those updates to Twitter. An announcement about the incident is also visible on our Jira Service Management portal. These messages then direct the user back to the status page for more details on the incident. Managing incidents in Jira Service Management allows for multiple points of communication without getting wires crossed or losing your customers’ trust in translation.
Tailor alerts and communications to the right audience
When an incident arises, you need to know who to communicate to, how to reach them, and how to do it with the least friction and fewest resources possible in order to avoid a customer service nightmare and/or communication meltdown. It’s best to start internally with an immediate response team and work outward, curating messages for the appropriate audience.
While every organization is different, in general it helps to think of these audiences as 5 distinct groups that need to be communicated with:
- Core on-call team: The first to know something is wrong, almost immediately upon impact (usually from monitoring and alerting tools). Internal teams work behind the scenes to detect, swarm, contextualize, and resolve incidents with collaborative communication tools.
- Front-line support team: Those who will be directly answering questions and giving customers updates during the incident. It’s an incredibly important role, so this team must get the right information to pass along to end users.
- Managers and executive team: The core team needs to communicate with this group so they know what’s going on, the potential impact on the following two groups, and hopefully an estimate of how long it could last.
- General employee population: Employees need to be kept informed as services they rely on go down and up. Proactively communicating with these users means less “what’s the status of this” questions, fewer duplicate IT support tickets, and more focus to fix the problem at hand.
- External customers: If the incident affects external customers some communication must be sent out to explain the problem and when they can expect a fix – or at least an update every nth amount of time. For issues that are still currently affecting your customers’ ability to use your product, we recommend never going more than one hour without sending an update. You should also always indicate when to expect the next update. If it is a severe enough incident – especially one involving security or data loss – you will definitely want to expedite external comms and pull in the necessary other teams (legal, HR, security, etc.)
Set up templates for incident and outage communication
In the heat of an incident, the last thing you want to worry about is how to wordsmith an incident announcement. Wording the incident the wrong way is a perfect target for non-technical managers who might be looking for any reason to criticize your team’s response process.
Decide on the common language ahead of time, get it approved by your managers, and save it in a template. This makes it easy to plug in the relevant details and fire off an incident the day of.
Here are two of the incident templates we use for our own status page:
- The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive. We’re investigating the cause and will provide an update as soon as possible.
- Our storage provider for public metrics data is currently experiencing infrastructure issues. Updates will be made available as the situation develops or information is provided to us.
Managing communication like a pro
The lifecycle of an incident will likely include several points of contact. Done well, there’s a familiar three-act structure to an incident: First contact, updates during the incident, resolution and post-mortem.
Prologue: Centralized internal team communication
Before anything else, internal teams on the back end of an incident should have an established communication platform and be ready to swarm when an issue arises.
Centralizing and filtering alerts across monitoring, logging, and CI/CD tools ensures a fast response from your team. With a platform like Jira Service Management, teams can quickly swarm an incident, gain context, and stay in touch throughout the duration of an incident.
Part 1: First contact
The initial update is the most important. Everything from what you say, to how and when you say it sets the tone for how your response will be perceived. This is where it really helps to have a template set up ahead of time.
Your goal should be to quickly acknowledge the issue, briefly summarize the known impact, promise further updates and, if you’re able, alleviate any concerns about security or data loss. It’s important to acknowledge there’s an issue, even if you don’t know the exact details yet.
Part 2: Regular updates during the incident
Mid-incident communication is critical.
The SRE teams at Google list Communication Lead as one of the key roles someone should oversee during an incident.
From Google’s book “Site Reliability Engineering” on the role of communication lead:
This person is the public face of the incident response task force. Their duties most definitely include issuing periodic updates to the incident response team and stakeholders (usually via email), and may extend to tasks such as keeping the incident document accurate and up to date.”
This person will also be in charge of continuing to update the status page or post updates to other channels as the situation evolves. Even an update saying “We’re still working on the problem, nothing new to report,” is better than saying nothing and leaving your audiences hanging. People left in the dark start to expect the worst.
Communication with affected users and other stakeholders is imperative. Use your pre-determined channel(s) to tell users what’s going on. On a homepage, this may be a Statuspage alert to help customers see that your team is aware of the problem and saves agents time from dealing with redundancy. Keep customers in the loop using multiple notification channels, including SMS, email, and mobile push.
Whatever tool you choose to use, we recommend that you identify one as your primary communication vehicle and funnel everyone there from the other channels. Managing incident communications through Jira Service Management ensures the right messages get to the right people.
Part 3: Resolution, post-mortem, what comes next
In 2010, Facebook suffered its largest outage to date. For about 2.5 hours, the social network was unavailable for millions of its then-half-a-billion users.
The timing couldn’t have been worse for the burgeoning tech giant, which was still in the early days of its explosive user growth and still proving to the business world that the service was worth the hype.
When the dust settled, a Facebook engineer posted a 395-word summary to the company’s engineering blog about the incident.
From the blog:
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.
The outline of the post-mortem is simple:
- Acknowledge the problem, empathize with those affected and apologize
- Explain what went wrong and why
- Explain what was done to fix the incident and what was done to prevent repeat incidents
- Acknowledge, empathize, and apologize once again
There’s no need for flowery language or grandiose claims in communication like this. Keep it simple and direct. For example, from the Facebook blog:
We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.
Language like this makes it easy for your customers and colleagues to trust that you’re running a level-headed team and keeping your eye on the ball. Browse our own incident response postmortem template for more ideas.
The reality of running always-on services is that sometimes, things unexpectedly break. Effectively communicating during downtime can actually build trust with both colleagues and customers. Responding well can make all the difference. We’ve also created an incident template generator to help you quickly write effective communications during incidents.
