4x Hour EC2 Outage for the Entire AWS Sydney Region


Not yet resolved, currently impacting:

RDS WorkSpaces ELB Appstream2 Lambda EastiCache EC2

14 pointsLilBytes posted a month ago8 Comments
gelatocar said a month ago:

Latest update from AWS:

[08:49 PM PST] We wanted to provide you with more details on the issue causing increased API error rates and latencies in the AP-SOUTHEAST-2 Region. A data store used by a subsystem responsible for the configuration of Virtual Private Cloud (VPC) networks is currently offline and the engineering team are working to restore it. While the investigation into the issue was started immediately, it took us longer to understand the full extent of the issue and determine a path to recovery. We determined that the data store needed to be restored to a point before the issue began. In order to do this restore, we needed to disable writes. Error rates and latencies for the networking-related APIs will continue until the restore has been completed and writes re-enabled. We are working through the recovery process now. With issues like this, it is always difficult to provide an accurate ETA, but we expect to complete the restore process within the next 2 hours and begin to allow API requests to proceed once again. We will continue to keep you updated if that ETA changes. Connectivity to existing instances is not impacted. Also, launch requests that refer to regional objects like subnets that already exist will succeed at this stage, as they do not depend on the affected subsystem. If you know the subnet ID, you can use that to launch instances within the region. We apologize for the impact and continue to work towards full resolution.

LilBytes said a month ago:

Thank you for this. Our TAM confirmed the same a few hours ago.

Is it me, or are AWS outage messaging always really invasive?

Error rates and API latency? I mean, 100% is technically a 'rate' between 1 and a 100.

wryun said a month ago:

The frustrating thing is this is also sending out other services (ECS Fargate, CodeBuild) that presumably can't allocate capacity, and they don't admit it on the status page (hey, it's still a 200!).

My favourite error of today:

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '135', 'content-type': 'application/x-amz-json-1.1', 'date': 'Thu, 23 Jan 2020 04:16:02 GMT', 'x-amzn-requestid': '...'}, 'HTTPStatusCode': 200, 'RequestId': '...', 'RetryAttempts': 0}, 'failures': [{'reason': 'Capacity is unavailable at this time. Please try ' 'again later or in a different availability zone'}], 'tasks': []}

gelatocar said a month ago:

This is impacting us as our application relies heavily on Lambdas connecting to RDS via VPC. Has been down on and off since about 12pm AEST today. Doesn't seem like there's any easy ways for us to resolve or work around without just waiting for AWS to fix.

quixquaxqux said a month ago:
LilBytes said a month ago:

Has zero detail. I thought the URL I provided has more albeit, you need to log in.

borplk said a month ago:

Obligatory reminder that if this kind of "X hour downtime per year" is such a big problem for your business you can go multi-AZ and multi-region.

If you choose not to, that's a valid and often the right choice but then don't panic and scream when something fails. It's a trade-off decision that you have to make. You can't have your cake and eat it too.

Either you care enough to design, build, and test that kind of fault tolerance or you don't. If you do you get its benefits during these times. If you don't then just own it and put up with the downtimes a few times per year.

smmpanelhero said a month ago:

