Cosmic Global - Network Outage – Incident details

Network Outage

Resolved
Partial outage 30 %
Started about 23 hours agoLasted 37 minutes

Affected

Network Points of Presence

Major outage from 5:32 AM to 5:41 AM, Partial outage from 5:32 AM to 5:41 AM, Degraded performance from 5:32 AM to 5:41 AM, Under maintenance from 5:41 AM to 5:50 AM, Operational from 5:41 AM to 5:50 AM, Under maintenance from 5:50 AM to 6:09 AM, Operational from 5:50 AM to 6:09 AM

Los Angeles, California

Major outage from 5:32 AM to 5:41 AM, Under maintenance from 5:41 AM to 6:09 AM

Dallas, Texas

Partial outage from 5:32 AM to 5:41 AM, Operational from 5:41 AM to 6:09 AM

Ashburn, Virginia

Degraded performance from 5:32 AM to 5:41 AM, Operational from 5:41 AM to 6:09 AM

London, United Kingdom

Major outage from 5:32 AM to 5:41 AM, Operational from 5:41 AM to 6:09 AM

Amsterdam, Netherlands

Major outage from 5:32 AM to 5:41 AM, Operational from 5:41 AM to 6:09 AM

Updates
  • Postmortem
    Postmortem

    At approximately 1:25AM EST during a routine filter re-deployment (as a hotfix) to address concerns visualized through our monitoring, the deployment process of our Luascript code went haywire. This was not due to any code logic or bugs introduced into the code, as we redeployed the hotfix moments after without issue, but what we believe to be an extremely unlucky software bug that cascaded through all PoPs due to the way the syncing system works. This was not a full outage, but it is what we would classify as a major outage with more than 80% of traffic being dropped at the time.

    The issue was 95% resolved within the first 10 mins by bringing back up Ashburn, Dallas, London and Amsterdam without issue. Los Angeles took about 12 minutes extra due to needing to reboot the filtering appliances there, and gradually shifting traffic back online. Frankfurt was not affected during this time and its traffic was still flowing through the network.

    We are currently investigating why this specific filter deployment went haywire, as we have deployed code over 30 times in the month of March without a single issue, so the situation is obviously one of concern to our customers, and we acknowledge that and are investigation due to it's how peculiar this situation was.

    We do also acknowledge how the 3 outages within the last 30 days are a significant cause for concern, and we do not want to make excuses, but we do want to iterate that all 3 outages have been caused by things out of our control at the time, but we are currently implementing ways to control them. We do understand that to you, our customers, these are in our control as you trust us to maintain the stability of the network you use, so we take full responsibility for the outages even if they weren't directly caused by us, and are making it our goal to implement increased redundancy and resiliency.

    We are currently in the process of shipping out upgraded filtering hardware & software, upgraded routers and upgraded router components to critical locations to not only introduce further redundancy, but as previously mentioned to improve resliency overall.

    We will have more to share on this in the coming weeks, but rest assured we are working on the issues, and we completely understand your frustrations as a customer. Do not hesitate to voice any concerns to us, and we will be glad to respond.

    Please also take a look at the RFO for the outage in Dallas last week and our plans for the future:

    https://status.as30456.net/cmn9oqp8405n49b3k562hrh6i

  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    All PoPs have recovered except for LAX. We are working on restoring traffic to LAX.

  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.
  • Investigating
    Investigating

    We are currently investigating this incident.