Yesterday April 8th 2021 at around 22:00 UTC, Facebook experienced a major outage where Facebook, Messenger, WhatsApp web and Instagram were down, lasting for as much as 3 hours.
This was reported at Facebook’s status page (https://developers.facebook.com/status/), which was a good example of how to communicate and incident.
So, what could have been improved?
There are a few points that we think could have been done better on how this incident was communicated.
We would suggest learning from the following:
1. Status page should not have been hosted within their own infrastructure
Their status page went down together with their services as it seems to be hosted within the same infrastructure, or at least sharing some resources.
Why this is a strategy to avoid we address in our blog post Learning from Facebook: Keep your Status Page Separately from your Infrastructure.
2. Lack of information within the incident report
There is no status update regarding when the incident was first identified, no information regarding the investigation or resolution either, only a short text marking it as resolved.
So there is no way to tell for sure when the incident started and end, there is a start time and last update time, but no end time, or “duration” field for that matter.
Here is an example of an incident reported with appropriate start date and duration information.
3. Status page did not reflect the current state of the services
At some point Facebook’s status page was reporting “Platform is Healthy” while it was clearly still undergoing an outage at least for some users, as reported in Hacker News.
This is why it’s so important to automate incident reporting within your status page, connecting your monitoring services to your status page in order to ensure it reflects the current state of the matters.
4. No way to subscribe for notifications
There doesn’t seem to be a way to subscribe to receive notifications regarding Facebook’s services status, the Subscribe button takes you to their developer notification settings page, where there is no clear way to subscribe for services status updates.
Subscribing should be easy and clear so any relevant user can get notified when there are outages, furthermore other channels than email like SMS or even Slack are great options.
Successful Incident communication is key to keep your customer’s trust during downtime, more and more companies are opting for a status page as their primary tool for this, so keep this points in mind when choosing your status page provider as well as during the process of reporting outages.
If you are considering starting to take incident communication seriously, take a look at Statuspal, we cover all of the points mentioned above, and much more.