Timeouts when serving requests

Incident Report for Section

Postmortem

Upstream Availability Incident

Key data

Incident Publication: https://status.section.io/incidents/s4n9vcl6dkrk
Start: 2016-03-31 20:20 AEDT
Mitigated: 2016-04-01 12:00 AEDT
Root Cause: NSW-IX, a major connection point for Australia's Internet backbone, had misconfigured Internet network paths outside of section.io's direct control

Summary

The network paths on the Internet that section.io used to communicate to some hosting partners suffered a high degree of packet loss. This packet loss occurred on the backbone of Australia's Internet, out of the control of section.io.

This packet loss meant that section.io could not connect to the hosting partners effectively, which resulted in slow page times, and possibly broken page loads.

section.io routed traffic around the problematic provider until the problem at NSW-IX was resolved.

Technical Background

We were notified of the incident by a customer report that their site was not loading properly.

During the early investigation, it was believed that this was a problem with the customer's hosting platform. We commenced troubleshooting and mitigation strategies to fix that specific customer.

After some time, we recognized that the problem was affecting multiple customers. We conducted a thorough review of our systems and found no problems.

We then collected a list of affected customers and worked to determine the commonality between those systems. The affected customers all used hosting providers with data centers in NSW. Since the problem was affecting multiple data centers, neither section.io or the hosting providers could be at fault.

We then examined the network paths that connect section.io's infrastructure, which runs on Amazon Web Services (AWS), to the customer data centers. All of the affected customers had data centers that connected to the Internet via an Internet Exchange called NSW-IX.

During our troubleshooting, we found that any traffic passing from AWS to a data center via the NSW-IX exchange point had problems.

With this understood, we started to find alternative paths from section.io's network to reach the data centers. Two strategies were implemented. Firstly, we would change DNS records to bypass section.io while new paths were established. Secondly, after new paths were established, DNS changes were made to route traffic onto these new paths.

We then worked with NSW-IX and our hosting partners to validate that the problems were resolved. We are currently restoring normal service fo all affected customers.

It is important to note that this is a low-level problem caused in the backbone of the Internet in NSW. section.io is not the only system to rely upon NSW-IX. This means websites that were not hosted on section.io were also affected by the problem.

Next Steps

There are areas for improvement that section.io has undertaken:

Earlier event notification. We are always disappointed when customers tell us that their sites are having issues. Our monitoring was too centered around making sure our platform was working, and not about external systems like connectivity to customer data centers. We have already implemented new high priority alerts that follow our 24x7 escalation policy to detect these problems.
Systems to find alternative network paths took some time to establish. While we were able to establish these paths within hours, these systems could have been on standby. Keeping these systems on standby will reduce the amount of time required to correct network paths in the future. We have already created these alternate paths during the incident management and these are now readily available.
One customer experienced a secondary problem that was created as a part of our remediation work, where some human error allowed a faulty DNS change to be released without proper testing. We have established new team communcation systems to facilitate better coordination and verification during incident management.
Some affected customer was not properly notified. We are aware that not all customers subscribe to our status.section.io notification platform. Therefore, we attempted to raise support tickets (support@section.io) for all affected customers. Not all customers received these support tickets. Our updated incident management procedures should ensure that all users receive updates from our status.section.io system. Users will be automatically subscribed to this system in the coming week.

Posted Apr 04, 2016 - 16:02 UTC

Resolved

The issue has been confirmed & resolved by our upstream network provider. Affected sites moved to North American clusters will be moved back to Australia. Affected customers will receive communication via support tickets. A technical outage report can be provided upon request.

Posted Apr 04, 2016 - 00:32 UTC

Identified

Issues are still ongoing and we are moving affected websites to our North America clusters.

Posted Apr 01, 2016 - 03:17 UTC

Monitoring

Our service provider has restored connectivity, we will continue to monitor upstream origin timeouts.

Posted Mar 31, 2016 - 21:19 UTC

Update

Mitigation for affected customers are being invoked. We will raise support tickets for each affected customer.

Posted Mar 31, 2016 - 20:01 UTC

Investigating

We are seeing a number of timeouts when trying to connect to a few client origins. This is resulting in errors being displayed for some requests. We are investigating the issue, which at this time appears to be an upstream network problem. We have contacted our service providers to attempt to resolve this issue.

Posted Mar 31, 2016 - 15:55 UTC