I’ve spent a lot of time with mega datacenters (MDCs) around the world trying to understand their problems – and I really don’t care what area those problems are as long as they’re important to the datacenter. What is the #1 Real Problem for many large scale mega datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some mega datacenters have crafted their own solutions – but most have not.
Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyone’s book that’s a lot. It’s also a very interesting statistical sample size of HDDs. MDCs get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fry’s store. So you would imagine if a disk fails – no one cares – they’re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:
Disk rebuild and/or data replicate of 2TB or 3TB drive
Performance overhead of a RAID rebuild makes it difficult to justify, and can take days
Disk capacity must be added somewhere to compensate: ~$40-$50
Redistribute replicated data across many servers
Infrastructure overhead to rebalance workloads to other distributed servers
Person to service disk: remove and replace
And then ensure the HDD data cannot be accessed – wipe it or shred it
Let’s put some scale to this problem, and you’ll begin to understand the issue. One modest size MDC has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other MDCs, but they are still huge – more than 200k servers). Other MDCs I have checked with say – yep, that’s about right. And one engineer I know at an HDD manufacturer said – “wow – I expected worse than that. That’s pretty good.” To be clear – these are very good HDDs they are using, it’s just that the numbers add up.
Full Story: What is false disk failure, and why is it a problem? – TechSpot.