Sunday, June 12, 2016

AWS blames 'latent bug' for prolonging Sydney EC2 Outage

Amazon Web Services said the court extended its services to Sydney suffered last weekend downtime attributed to a combination of supply problems and a "latent error in our case management software."

Sydney has recorded over 150 mm of rain on the weekend. Sunday 5 City copped only 93 mm, more wind gusts of up to 96 km / h.

Amazon said that bad weather meant that "A 22:25 PDT on June 4 [Sunday afternoon in Sydney - Ed], our power company suffered a loss of power in a regional substation after bad weather in the region. this failure resulted in a total loss of power network to multiple facilities EMA ".

AWS has two backup power systems, but in some cases, both backups failed on the night in question.

The giant cloud explanation says that your backups employ a "diesel rotary UPS (DRUPS), which incorporates a diesel generator and a mechanical reverse."

"In normal operation, the DRUPS using AC power to spin a wheel that stores energy. If the power supply is interrupted, the DRUPS uses this stored energy to continue to provide power to the data center while the integrated generator active to continue to provide power until power is restored. "

Last week, however, "a set of switches to isolate the mains responsible DRUPS did not open fast enough." It was bad because these switches must "ensure that the reserve power DRUPS is used to support the load of the data center during the transition to power the generator."

"Instead, the power reserve system DRUPS evacuated quickly in the gradient of the network."

This failure meant that diesels could not send juice to the data center, which fell quickly.

AWS technicians have things running again at 23:46 PDT and 1:00 pm PST 5, "more than 80% and volumes were affected customers back online and running." Some workloads have been slower to recover, thanks to AWS called "DNS resolution errors internal DNS host for availability zone were back online and manages the burden of recovery."

However, some cases have not returned. AWS now says this is due to "a latent error in our case management software" which meant some cases should be restored manually. AWS has not explained the nature of this error.

Other cases have been affected by dead units meant data is not available immediately. It requires manual work to restore the data.

As always the case after such damage, AWS has promised tougher concepts that have failed.

"Even if we had an excellent operating performance power configuration used in this facility," says mea culpa "it is clear that we need to improve this particular design to avoid falling a similar affect our infrastructure power distribution power" .

More switches are on the agenda "to ensure that we start faster to the gradient power supply connections to allow our generators to activate before the UPS systems have been exhausted."

Improvements are also expected in the software, including "changes to ensure our APIs are even more resistant to failure" so that those who use multiple AWS regions can rely on switching between bits of barns.

It is expected that these changes to the earth in the area of ​​Sydney in July.

AWS is far from being the only one who has suffered physical problems or software with cloud. Salesforce also had conflicts with circuit breakers .. Google broke its own cloud with an error and lost data after a lightning strike.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.