Death and Resurrection of an SSD

Summary: SSDs live fast, die young, and pretend to be OK even while they’re dying. Don’t use one without awesome backups.

And sometimes, they come back from the dead.

※ ※ ※

On November 30 2010, I received my first SSD: a 240 GB OWC Mercury Extreme Pro.

On Thursday, November 10 2011, the drive “died”.

I claimed on twitter there was no warning: there weren’t any I/O errors logged to Console.app > All Messages (a standard technique to recognize a drive going bad). Looking back on it now, there were hints: Alfred corrupting its SQLite database, EyeTV losing its schedules and a recording, pbs (Pasteboard Server) crashing, mds throwing a hissy-fit (not uncommon) and finally a kernel panic (an uncommon event).

Last week I came back from lunch to discover my machine frozen. It was still pingable, but everything that touched the disk locked up. I held down the power key to force a hard reboot.

My machine bounced back, but with kernel BootCache warnings in the Console log. After a bit of googling, I decided to restart the machine in Safe Mode, which I understood would rebuild the BootCache. Turns out it also runs fsck, putting up a nice little progress bar. It was taking a long time, so I went to the gym. I came back an hour later, and the progress bar was where I left it: around 30%.

Uh-oh.

I booted off my nightly SuperDuper backup and launched Disk Utility.

My internal SSD fell off the bus: wouldn’t even appear on the device list. My SSD was gone.

I powered down my MacBook Pro and prepared to yank the drive for replacement from OWC. I didn’t expect any issues, they’ve happily replaced two traditional failed drives for me in the past. On a hunch, after I yanked the battery, I counted to down 10 and plugged the machine back in.

It successfully booted off the SSD.

So my SSD was “back”. My guess is the drive firmware simply turns off its SATA connection when it gets backed into an unrecoverable corner. Removing power seemed to “unlock” the drive.

This is kind of a worst-case scenario, since I didn’t trust the drive anymore but it seemed to be working and OWC may not want to replace it.

fsck came back with a couple of invalid inodes, and indicated successful repair. Still not trusting it, I tried a traditional way to force drive failure: a reformat with writing zeros. ~45 minutes later, the drive mounts sans any reported trouble.

If this was a traditional drive, I might have started to trust it again. However, I know about an extra trick some SSDs have up their sleeve: block-level de-duplication.

So I wrote a small C program that fills a file or device with random data. Note to Unix pedants: I know I could have done this with shell commands or your $FAVORITE_LANGUAGE, but I wanted to get close to the kernel on this one and reduce variables for ease of reproduction.

Random data defeats deduping, and I ran my program with parameters to fill my SSD. I went to bed.

This morning I discovered my SSD, which happily survived a complete filling with zeros, failed when I tried to fill it with random data. It fell off the bus again, and wouldn’t show up in Disk Utility.

I removed the drive from the MacBook Pro and plugged it into my Mac Pro via a drive toaster and reran the stress test.

It passed. I reran it again. It passed again.

My SSD had resurrected itself.

I have since reinstalled the SSD and am happily using it again. I’ve also rewritten and enhanced my initial C program into a better, faster one I’ve entitled stressdrive. It now passes my SSD with flying colors:

$ sudo ./stressdrive /dev/rdisk0
blockSize: 512
blockCount: 468862128
speedScale: 16x
scaled blockSize: 8192
scaled blockCount: 29303883
writing random data to /dev/rdisk0
writing 100% (block 29303002 of 29303883)
1779f30a231c1d07c578b0e4ee49fde159210d95 : SHA-1 of written data
verifying written data
reading 100% (block 29302306 of 29303883)
1779f30a231c1d07c578b0e4ee49fde159210d95 : SHA-1 of read data
SUCCESS

My current hypothesis is that my SSD wore out a flash block and attempted to mark the block as bad and recruit a fresh block from its overprovisioning reserve. This path has a bug, causing the controller to panic. Maybe the supposedly fresh block also had issues, maybe a few of them did. I’m thinking restarting the SSD by removing power helped it make progress in the recovery until it succeeded.

I did have the SSD mysteriously drop off my internal bus again today right before the stressdrive test, so I’m keeping an eye on it – I may not be out of the woods yet.

By the way, through all of this, my SSD’s SMART status has remained “Verified”. Ugh.

This week two friends of mine with a Sandforce controller also had SSD failures similar to mine, where the drive fell off the bus. At least one had the same experience as me where the drive was able to “resurrect” itself and pass “surface scans” (whatever those are). Anecdotes aren’t data, but there you go.

I should also mention OWC has been a champ, proactively finding my original order with them and emailing me when I originally mentioned my failure on twitter. They’ve offered to replace my drive, but I’m keeping it for now. What can I say, running a fast drive that may die at any moment makes me feel alive.

sysadmin ssd Dec 4 2011