Archive for May, 2010

Google just launched a competitor to S3

Thursday, May 20th, 2010

Product Overview – Google Storage for Developers – Google Code.

Looks like the same basic premise, design, and pricing.     This is awesome.  Now you can solve the “what if s3 is down/slow” question with a redundant vendor instead of inhouse infrastructure.

Documentation failure part of Gulf Oil Spill

Thursday, May 13th, 2010

Oil spill: BP had wrong diagram to close blowout preventer | McClatchy.

Choice quotes:

In the days after an oil well spun out of control in the Gulf of Mexico, BP engineers tried to activate a huge piece of underwater safety equipment but failed because the device had been so altered that diagrams BP got from the equipment’s owner didn’t match the supposedly failsafe device’s configuration

“When they investigated why their attempts failed to activate the bore ram,” Stupak said of BP engineers, “they learned that the device had been modified. A useless test ram _ not the variable bore ram _ had been connected to the socket that was supposed to activate the variable bore ram.”

“An entire day’s worth of precious time had been spent engaging rams that closed the wrong way.”

clearly not the cause, but there’s a DR lesson here for sure.

How to tune MySQL’s sort_buffer_size at Xaprb

Tuesday, May 11th, 2010

How to tune MySQL’s sort_buffer_size at Xaprb.

Baron’s down with the Church.

High Scalability – MocoSpace Architecture – 3 Billion Mobile Page Views a Month

Friday, May 7th, 2010

High Scalability – High Scalability – MocoSpace Architecture – 3 Billion Mobile Page Views a Month.

I love these posts.   Its super handy to see what others in what I’ve taken to calling the “middle class” of websites are doing.

Couple thoughts.  Their pageview to unique visitor ratio is waaaaay wide. I guess thats a byproduct of small micro-pages for mobile devices and the difficulty in user tracking on mobile.   I like that they clearly weren’t afraid to build where they thought they should and buy where they thought they should.  A mix of free/open-source stuff and of ‘enterprise’-vendor stuff shows a clear results-over-religion vibe.   They’ve also got the developers-pushing-live/you-break-it-you-own-it thing figured out, which is key to how they get away with 1 sysadmin with all that workload.

45 minute time-to-acknowledge for CloudFront outage

Thursday, May 6th, 2010

Came across this in my twitter stream:

Cloudfront looks down. I am scaling my downtime in the cloud.less than a minute ago via web

Which in and of itself is just amusing, but what got me was that when I went to the AWS Status page there was no mention of it. It had already been 15 minutes since the tweet, but nothing. So I did a twitter search, and found this one:


Is the Amazon CloudFront down?less than a minute ago via web

Which got posted nearly 45 minutes before amazon updated their status page to say:

11:19 AM PDT We’re investigating reports of timeouts in California.

Now I don’t think downtime, especially infrequent and geographically limited downtime, is any sort of “see thats why you shouldn’t use teh cloud!” argument. If anything odds are people who use cloudfront have more downtime in their webapp than amazon ever will. However 45 minute response times mean 45 minutes of wasted troubleshooting time, multiplied by the number of engineers looking into it. Its that feeling of blind, uninformed, in-the-darkness while your cloud provider gets around to acknowledging things that worries me.