recentpopularlog in

ignatz : outages   8

Details of the Cloudflare outage on July 2, 2019
Great writeup from jgc. Worth noting some important lessons:

* config changes should be rolled out carefully and gradually, just like code;

* particularly regexps, which are effectively code anyway;

* emergency-use rollback systems need to work, of course!;

* having emergency-only systems is a risk, too, since infrequently-used code paths are likely to atrophy and break without anyone noticing (as nsheridan said);

* /.*/ in a regexp is pretty much always bad news, and would have been worth a linter to catch before commit.
cloudflare  outages  regex  postmortems  regexps  deployment  rollback  via:jgc  via:jm 
29 days ago by ignatz
An update on Sunday’s service disruption | Google Cloud Blog
postmortem on google outage. note that once again it's a config error, which is how it happen now
gcp  google  odd  outages  post-mortems  networking  config  sysadmin  ops  via:jm 
10 weeks ago by ignatz
OVH suffer 24-hour outage (The Register)
Choice quotes:

‘At 6:48pm, Thursday, June 29, in Room 3 of the P19 datacenter, due to a crack on a soft plastic pipe in our water-cooling system, a coolant leak causes fluid to enter the system';
‘This process had been tested in principle but not at a 50,000-website scale’
postmortems  ovh  outages  liquid-cooling  datacenters  dr  disaster-recovery  ops  via:jm 
july 2017 by ignatz

Copy this bookmark:





to read