DevOps Operations Performance Platform

PagerDuty Blog

Subscribe to PagerDuty Blog: eMailAlertsEmail Alerts
Get PagerDuty Blog: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Feed Post

Initial Outage Report

Yesterday was a bad day for the cloud. PagerDuty, as well as many of our customers and colleagues, suffered significant outages as a result of multiple sophisticated DDoS attacks on a popular DNS provider.

We suffered a major outage yesterday, Friday, October 21 which lasted for nearly 3 hours from approximately 10 am to 1 pm Pacific Standard Time. During this time, we were completely unavailable for about 30 minutes, followed by a period of limited availability due to a very high load as we cleared a large backlog of queued notifications and resolved additional DNS-related issues in our systems.

Our mission is to be your trusted and reliable incident response and resolution partner ALWAYS. This includes times when you have minor issues, times when you have a major outage, and times when half the internet is down. Yesterday, we did not meet the high expectations that we have set for ourselves. I am personally disappointed and regret our poor performance and downtime during this major incident. Myself and the entire team at PagerDuty are truly sorry.

All of our services have been restored and have been operating normally since yesterday, Friday, October 21 at 1 pm Pacific time. Since then, we have all hands on deck conducting a thorough post-mortem. We will communicate regularly to keep you posted on what happened, what we did to address it, and what we are doing to avoid this from happening again. In the next few days, we will publish the following two follow-up posts:

  • On Monday, Oct 24: a complete timeline of events outlining what happened and what we did to fix the outage
  • On Tuesday, Oct 25: the root-cause resolution action plan which outlines the set of action items we will undertake to help prevent such issues in the future

Yesterday’s outages were caused by a major black swan event, an event that many of us in the industry were not prepared for. You count on us to be prepared and we should have been. It doesn’t matter how unique this event was — there are no excuses. We are not shifting the blame to any other parties and we are not saying “we didn’t see it coming”. Quite simply, we need to be prepared to handle these kinds of situations. We need to be up when you’re down — in fact you rely on us when your systems are down. All of us at PagerDuty are disappointed and sorry for this outage.

This was a wake-up call for us. We will make this right for you and for your company. We will work diligently to learn from this incident and we are committed to becoming a better, stronger, more resilient, and more reliable partner. The entire team at PagerDuty, from our developers and ops teams, to customer support, to sales, services and our leadership team are passionate about delivering the best service to you, and we will do everything we can to prove you can still count on us going forward.

Please don’t hesitate to contact me directly or contact our support team if you have any questions or concerns. And, please stay tuned for our next post on Monday, October 24 which will cover the full timeline of events.

Sincerely,

Alex Solomon
CTO & Co-founder

The post Initial Outage Report appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.