At approximately 1:25AM EST during a routine filter re-deployment (as a hotfix) to address concerns visualized through our monitoring, the deployment process of our Luascript code went haywire. This was not due to any code logic or bugs introduced into the code, as we redeployed the hotfix moments after without issue, but what we believe to be an extremely unlucky software bug that cascaded through all PoPs due to the way the syncing system works. This was not a full outage, but it is what we would classify as a major outage with more than 80% of traffic being dropped at the time.
The issue was 95% resolved within the first 10 mins by bringing back up Ashburn, Dallas, London and Amsterdam without issue. Los Angeles took about 12 minutes extra due to needing to reboot the filtering appliances there, and gradually shifting traffic back online. Frankfurt was not affected during this time and its traffic was still flowing through the network.
We are currently investigating why this specific filter deployment went haywire, as we have deployed code over 30 times in the month of March without a single issue, so the situation is obviously one of concern to our customers, and we acknowledge that and are investigation due to it's how peculiar this situation was.
We do also acknowledge how the 3 outages within the last 30 days are a significant cause for concern, and we do not want to make excuses, but we do want to iterate that all 3 outages have been caused by things out of our control at the time, but we are currently implementing ways to control them. We do understand that to you, our customers, these are in our control as you trust us to maintain the stability of the network you use, so we take full responsibility for the outages even if they weren't directly caused by us, and are making it our goal to implement increased redundancy and resiliency.
We are currently in the process of shipping out upgraded filtering hardware & software, upgraded routers and upgraded router components to critical locations to not only introduce further redundancy, but as previously mentioned to improve resliency overall.
We will have more to share on this in the coming weeks, but rest assured we are working on the issues, and we completely understand your frustrations as a customer. Do not hesitate to voice any concerns to us, and we will be glad to respond.
Please also take a look at the RFO for the outage in Dallas last week and our plans for the future:
https://status.as30456.net/cmn9oqp8405n49b3k562hrh6i