After my poor experience with Software RAID on Mac OS X 10.7.3, I’m pleased to report that software RAID on OS X 10.8 seems improved over 10.7.
Enabling Software RAID does have some tradeoffs:
You lose OS X’s Recovery Partition. Internet Recovery should still work and you could always build your own Recovery Disk.
FileVault isn’t supported.
For my server purposes, I don’t care about either of these limitations, so I’m happy that OS X Software RAID is realistic option again (the “Repair Disk” button actually works).
Here’s how I set up my new 2x1TB Mac mini Server:
SuperDuper the Mini’s boot drive (“Server HD”) onto an external drive.
Boot from the external drive.
Repartition both drives into three slices each: 50GB Boot, 50GB Boot Backup and the rest Data.
Create three new RAID 1 (Mirroring) sets encompassing those slices, respectively.
SuperDuper the external drive back onto the Boot RAID volume and boot off it.
Here’s my set-up as it appears in Disk Utility:

This set-up enables me to perform system software updates with a bootable backup safety-net in place.
I’m backing up that Data partition offsite, so I’m not backing it up locally on the Mini itself (I’m just SuperDuper’ing the Boot volume to the Boot Backup volume nightly).
Summary: SSDs live fast, die young, and pretend to be OK even while they’re dying. Don’t use one without awesome backups.
And sometimes, they come back from the dead.
※ ※ ※
On November 30 2010, I received my first SSD: a 240 GB OWC Mercury Extreme Pro.
On Thursday, November 10 2011, the drive “died”.
I claimed on twitter there was no warning: there weren’t any I/O errors logged to Console.app > All Messages (a standard technique to recognize a drive going bad). Looking back on it now, there were hints: Alfred corrupting its SQLite database, EyeTV losing its schedules and a recording, pbs (Pasteboard Server) crashing, mds throwing a hissy-fit (not uncommon) and finally a kernel panic (an uncommon event).
Last week I came back from lunch to discover my machine frozen. It was still pingable, but everything that touched the disk locked up. I held down the power key to force a hard reboot.
My machine bounced back, but with kernel BootCache warnings in the Console log. After a bit of googling, I decided to restart the machine in Safe Mode, which I understood would rebuild the BootCache. Turns out it also runs fsck, putting up a nice little progress bar. It was taking a long time, so I went to the gym. I came back an hour later, and the progress bar was where I left it: around 30%.
Uh-oh.
I booted off my nightly SuperDuper backup and launched Disk Utility.
My internal SSD fell off the bus: wouldn’t even appear on the device list. My SSD was gone.
I powered down my MacBook Pro and prepared to yank the drive for replacement from OWC. I didn’t expect any issues, they’ve happily replaced two traditional failed drives for me in the past. On a hunch, after I yanked the battery, I counted to down 10 and plugged the machine back in.
It successfully booted off the SSD.
So my SSD was “back”. My guess is the drive firmware simply turns off its SATA connection when it gets backed into an unrecoverable corner. Removing power seemed to “unlock” the drive.
This is kind of a worst-case scenario, since I didn’t trust the drive anymore but it seemed to be working and OWC may not want to replace it.
fsck came back with a couple of invalid inodes, and indicated successful repair. Still not trusting it, I tried a traditional way to force drive failure: a reformat with writing zeros. ~45 minutes later, the drive mounts sans any reported trouble.
If this was a traditional drive, I might have started to trust it again. However, I know about an extra trick some SSDs have up their sleeve: block-level de-duplication.
So I wrote a small C program that fills a file or device with random data. Note to Unix pedants: I know I could have done this with shell commands or your $FAVORITE_LANGUAGE, but I wanted to get close to the kernel on this one and reduce variables for ease of reproduction.
Random data defeats deduping, and I ran my program with parameters to fill my SSD. I went to bed.
This morning I discovered my SSD, which happily survived a complete filling with zeros, failed when I tried to fill it with random data. It fell off the bus again, and wouldn’t show up in Disk Utility.
I removed the drive from the MacBook Pro and plugged it into my Mac Pro via a drive toaster and reran the stress test.
It passed. I reran it again. It passed again.
My SSD had resurrected itself.
I have since reinstalled the SSD and am happily using it again. I’ve also rewritten and enhanced my initial C program into a better, faster one I’ve entitled stressdrive. It now passes my SSD with flying colors:
$ sudo ./stressdrive /dev/rdisk0
blockSize: 512
blockCount: 468862128
speedScale: 16x
scaled blockSize: 8192
scaled blockCount: 29303883
writing random data to /dev/rdisk0
writing 100% (block 29303002 of 29303883)
1779f30a231c1d07c578b0e4ee49fde159210d95 : SHA-1 of written data
verifying written data
reading 100% (block 29302306 of 29303883)
1779f30a231c1d07c578b0e4ee49fde159210d95 : SHA-1 of read data
SUCCESS
My current hypothesis is that my SSD wore out a flash block and attempted to mark the block as bad and recruit a fresh block from its overprovisioning reserve. This path has a bug, causing the controller to panic. Maybe the supposedly fresh block also had issues, maybe a few of them did. I’m thinking restarting the SSD by removing power helped it make progress in the recovery until it succeeded.
I did have the SSD mysteriously drop off my internal bus again today right before the stressdrive test, so I’m keeping an eye on it — I may not be out of the woods yet.
By the way, through all of this, my SSD’s SMART status has remained “Verified”. Ugh.
This week two friends of mine with a Sandforce controller also had SSD failures similar to mine, where the drive fell off the bus. At least one had the same experience as me where the drive was able to “resurrect” itself and pass “surface scans” (whatever those are). Anecdotes aren’t data, but there you go.
I should also mention OWC has been a champ, proactively finding my original order with them and emailing me when I originally mentioned my failure on twitter. They’ve offered to replace my drive, but I’m keeping it for now. What can I say, running a fast drive that may die at any moment makes me feel alive.
If you’re trying to locally clone a remote git repo via ssh and are getting this error, it’s probably because git-upload-pack isn’t actually in your PATH:
local wolf$ git clone wolf@example.com:/Users/wolf/myproject
Initialized empty Git repository in /Users/wolf/code/myproject/.git/
bash: git-upload-pack: command not found
fatal: The remote end hung up unexpectedly
You can inspect your remote shell’s PATH like so:
local wolf$ ssh wolf@example.com 'echo $PATH'
/usr/bin:/bin:/usr/sbin:/sbin
Too bad, my git-upload-pack lives in /usr/local/bin, so it can’t be found.
Fortunately, it’s easy to add it to my PATH via .bashrc:
local wolf$ ssh wolf@example.com
remote wolf$ echo 'export PATH="$PATH:/usr/local/bin"' >> ~/.bashrc
remote wolf$ exit
Check our handy-work:
local wolf$ ssh wolf@example.com 'echo $PATH'
/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
Looks good, let’s try again:
local wolf$ git clone wolf@example.com:/Users/wolf/myproject
Initialized empty Git repository in /Users/wolf/code/myproject/.git/
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (3/3), done.
Success.
Once again I’ve forgotten how to set up Mac OS X’s built-in VNC server so I can start and stop it from an ssh session. So I’m documenting it here for my future-self.
First off, don’t even try to configure it via System Preferences > Sharing > Screen Sharing: turning VNC on via the GUI is apparently an entirely separate affair from turning on via kickstart.
For instance, it’s easy to get into a situation where you’re attempting to shutdown AppleVNCServer via the command line, but it doesn’t work since SharingPref.prefPane has it enabled.
So disable Screen Sharing on the GUI side altogether and do it straight-up 3270-style:
First enable access for specified users. You’ll probably only need to do this once:
$ cd /System/Library/CoreServices/RemoteManagement/ARDAgent.app/Contents/Resources
$ sudo ./kickstart -configure -allowAccessFor -specifiedUsers
$ sudo ./kickstart -configure -access -on -privs -all -users wolf,victoria,dave
After that, you can start VNC like so:
$ sudo ./kickstart -activate
When done, you can disable AppleVNCServer listening to 5900 like so:
$ sudo ./kickstart -deactivate -stop
Those invocations grow tiresome, so I wrote a small shell script I named vncdc (for “VNC daemon control”):
#!/bin/bash
case "$1" in
start)
echo 'Starting VNC'
sudo /System/Library/CoreServices/RemoteManagement/ARDAgent.app/Contents/Resources/kickstart -activate
;;
stop)
echo 'Stopping VNC'
sudo /System/Library/CoreServices/RemoteManagement/ARDAgent.app/Contents/Resources/kickstart -deactivate -stop
;;
esac
Now toggling VNC is as simple as vncdc start and vncdc stop.
By the way, there are a lot of tips floating around that advise enabling VNC legacy mode and a VNC-wide password. There’s no real need for that since Mac OS X 10.5 as far as I can tell.
Update: Wow, Ben Mitchell informs me of a much easier way. To enable VNC in 10.6:
$ sudo touch /etc/ScreenSharing.launchd
To disable it:
$ sudo rm /etc/ScreenSharing.launchd
Much nicer, and this method seems to work with SharingPref.prefPane.
Thanks, Ben.
I brought up a new Mac Pro (running 10.6.3) that will be plugged into the Net directly, so I ran a quick portscan to ensure there weren’t any externally-accessible services.
Turns out there were two: ftp (port 21) and kerberos (port 88).
ftp is weird: lsof -i -P|grep LISTEN doesn’t reveal anything listening to port 21, and it immediately disconnects when I attempt to telnet into the port. Right now my hypothesis is that it’s a launchd stub for the ftp system that I haven’t engaged (and don’t ever plan on engaging). I’m not stressing about it.
That leaves kerberos, or at least krb5kdc: the Kerberos v5 Key Distribution Center daemon.
I don’t remember krb5kdc running previously and I’m not sure what it does, so in the grand tradition of killing what you don’t understand, I sought to disable the beast.
First open its configuration file:
$ sudo pico /private/var/db/krb5kdc/kdc.conf
Inside I found a line that looked promising, kdc_tcp_ports. I commented it out and restarted the machine:
[kdcdefaults]
kdc_ports = 88
# kdc_tcp_ports = 88
Sure enough the beast failed to stir from its nest upon restart.
Nothing seems broken, and I’ve reduced the machine’s service surface-area, so I’m classifying this one as win.
Update: Nicholas Riley helpfully tweets a link to Andre LaBranche’s “observations and theories” on Leopard’s local KDC:
Some auth stuff will break if you disable the local KDC: http://dreness.com/wikimedia/index.php?title=LKDC
Thanks for the tip+link, Nicholas.
It’s a shame that Andre had to disassemble Leopard to dredge up clues as to what parts of the system now involve kerberos. Looking through Andre’s list, I’m not using any of those services, so how I’m getting away with disabling krb5kdc. Your mileage may vary.
Update: Dan Kuehling emailed me a link to this Kerberos on Leopard backgrounder. Thanks, Dan.
If you’re getting an abort: error: Broken pipe error when pushing a largish changeset:
$ hg push
pushing to http://user@example/hg/project
searching for changes
abort: error: Broken pipe
Then it’s probably due to issue 2716, which has already been fixed by coding rockstar Augie Fackler. Rock on, Augie.
Upgrade your Mercurial version — the just-released v1.5 does the trick for me.
However, that will probably lead you another error: abort: HTTP Error 413: Request Entity Too Large.
$ hg push
pushing to http://user@example/hg/project
searching for changes
abort: HTTP Error 413: Request Entity Too Large
In my case, this was because my mercurial server was sitting behind an nginx reverse proxy whose client_max_body_size was set to a measly 10 MB.
Raise your client_max_body_size limit, restart nginx and you should be good to go.
There’s a bunch of different ways to publish Mercurial repositories, but I like hgwebdir. It handles multiple repositories with a nice-enough UI.
What’s slick is that http://example.com/hg/myproject will be the url for both humans to inspect (view history, tags, branches, etc) and the url you’ll hand to hg clone.
These are the best instructions I found for setting up hgwebdir, but they’re not Debian-specific. So here’s my recipe for Lenny:
Mercurial: Out of the box, apt-get install mercurial will give you an old version of mercurial (1.0.1 if I recall correctly). That’s fine, for there’s Debian Backports. Follow their instructions and you’ll get a current version of Mercurial:
$ echo 'deb http://www.backports.org/debian lenny-backports main contrib non-free' >> /etc/apt/sources.list
$ apt-get update
$ apt-get -t lenny-backports install mercurial
Mercurial should now be installed. hg --version should come back with 1.3.1 or later.
If you want or need a later version than what Backports offers, it’s not hard to build a later version from source.
etckeeper: Putting /etc directory under version control just makes so much sense, and etckeeper 1) automates it for you, 2) tracks file permissions that stock vcs’s don’t record and 3) integrates with Debian’s package management system.
Install it like so:
$ apt-get install etckeeper
But etckeeper wants to use git by default, which we haven’t installed:
$ etckeeper init
/etc/etckeeper/init.d/40vcs-init: line 5: git: command not found
(apt-get didn’t install git because it saw we already had Mercurial installed. While etckeeper depends on Git, Mercurial or Bazaar, apt-get is smart enough to understand it didn’t need to install git to install etckeeper.)
Fortunately, it’s easy to configure etckeeper to use Mercurial by changing its VCS variable:
$ sed -i.bak -e 's/^# VCS="hg"/VCS="hg"/;s/^VCS="git"/# VCS="git"/' /etc/etckeeper/etckeeper.conf
Now etckeeper initialization+commit works:
$ etckeeper init
$ etckeeper commit 'initial commit'
No username found, using 'root@hg.localdomain' instead
Apache 2: Easy:
$ apt-get install apache2
Apache 2 should now be up and running.
If you want, run cd /etc && hg log -v -r tip to view the commit message etckeeper wrote for you. It’s really detailed, including all the packages that came along for the ride.
hgwebdir: We need a place to keep our hgwebdir.cgi file, its config files and the repos it offers. We’ll use /var/hg:
$ mkdir -p /var/hg/repos
$ nano /var/hg/hgweb.config
Populate hgweb.config:
[collections]
repos/ = repos/
Now let’s move the hgwebdir.cgi CGI program in place and make it executable:
$ cp /usr/share/doc/mercurial-common/examples/hgwebdir.cgi /var/hg
$ chmod ugo+x /var/hg/hgwebdir.cgi
Configure Apache: Our hgwebdir.cgi is in place and ready to be lit up. Let’s tell apache:
$ nano /etc/apache2/sites-enabled/000-default
And add the following to the </VirtualHost> section:
ScriptAliasMatch ^/hg(.*) /var/hg/hgwebdir.cgi$1
<Directory /var/hg>
Options ExecCGI FollowSymLinks
AllowOverride None
</Directory>
Tell apache its configuration has changed:
$ apache2ctl configtest
Syntax OK
$ etckeeper commit '+hgwebdir'
$ apache2ctl graceful
At this point, http://myhgserver/hg should be working, but is empty (no repos listed).
Add a repository:
$ mkdir -p /var/hg/repos/myrepo
$ cd /var/hg/repos/myrepo
$ echo 'hello world' > test.txt
$ hg init && hg addremove && hg ci -m 'initial commit'
$ chown -R www-data:www-data /var/hg/
Re-load http://myhgserver/hg and myrepo should now be listed and its version history and files browsable.
In addition, you should be able to check out a local copy of the repo on your workstation:
$ hg clone http://myhgserver/hg/myrepo
Enabling Push: You won’t be able to push changes back (yet). Try this client-side:
$ cd myrepo
$ perl -pi -e 's/hello/goodbye/' test.txt
$ hg ci -m 's/hello/goodbye/'
$ hg push
pushing to http:///myhgserver/hg/myrepo
searching for changes
ssl required
By default, pushes are only allowed over https. I have my hgwebcgi installation behind an nginx reverse-proxy with does https itself, so I don’t want or need https for hgwebdir.
You can disable this requirement by editing hgrc, setting push_ssl to false:
$ nano /var/hg/repos/myrepo/.hg/hgrc
[web]
push_ssl = false
Let’s try again:
$ hg push
pushing to http://myhgserver/hg/myrepo
searching for changes
abort: authorization failed
Progress: a different error. This error reflects that by default hgwebdir doesn’t allow anybody to push to a repo. Sensible+secure default. Let’s open it up by setting allow_push in hgrc:
$ nano /var/hg/repos/myrepo/.hg/hgrc
[web]
push_ssl = false
allow_push = *
Success:
$ hg push
pushing to http://myhgserver/hg/myrepo
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
User Accounts: Things are too open now — anyone can checkout the source and push back changes. There’s no protection against spoofing.
$ nano /etc/apache2/sites-enabled/000-default
Rewrite the Directory section to look like this:
ScriptAliasMatch ^/hg(.*) /var/hg/hgwebdir.cgi$1
<Directory /var/hg>
Options ExecCGI FollowSymLinks
AllowOverride None
AuthType Basic
AuthName hgwebdir
AuthUserFile /var/hg/htpasswd
Require valid-user
</Directory>
Reload apache:
$ apache2ctl configtest
Syntax OK
$ etckeeper commit '+hgwebdir auth'
$ apache2ctl graceful
Add a user:
$ htpasswd -c /var/hg/htpasswd myuser
$ chown -R www-data:www-data /var/hg/
Re-load http://myhgserver/hg, and now you should be prompted for a username and password. Likewise, you now need to specify a username and password to clone a repo:
$ hg clone http://192.168.73.130/hg/myrepo myrepo2
http authorization required
realm: hgwebdir
user: myuser
password:
requesting all changes
adding changesets
adding manifests
adding file changes
added 2 changesets with 2 changes to 1 files
updating working directory
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
And to push your changes:
$ cd myrepo2
$ perl -pi -e 's/goodbye/bonjour/' test.txt
$ hg ci -m 's/goodbye/bonjour/'
$ hg push
http authorization required
realm: hgwebdir
user: wolf
password:
pushing to http://192.168.73.130/hg/myrepo
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
Having to supply a username+password for each hg push gets really old, really fast. So I recommend installing mercurial_keyring.
Update: Augie Fackler writes:
I’d highly recommend mod_wsgi over cgi - the performance is worlds better (cgi bootstraps the whole python/hg world on each operation, wsgi lets that be a bit more persistent).
I tried to get wsgi working initially, but fell back to plain old CGI when I ran into some inscrutable error. My needs are lightweight, so the overhead of spawning a CGI process for each connection isn’t problematic for me, but you should check out wsgi if you’ll be leaning on your mercurial server heavily.
I have an Xserve that’s still on 10.4. It runs a web app with a typical MySQL database backend. Both on metal.
This morning the DB got so slow that the app server started returning “timed out” error pages.
Ungood.
So I dumped the database, created a new Debian VMware instance and loaded up the data. I ssh port-forwarded metal’s 3306 to the new VM and restarted the apps.
Yesterday it took around seven seconds to vend a page. Now it’s back under half a second.
I wonder what I’m doing wrong that MySQL on OSX metal is an order of magnitude slower than MySQL on virtualized Linux.*
Could it be the default settings on Mac OS X Server 10.4 are that much worse than the defaults on Debian 5 Lenny?
Mac OS X’s file system is a lot slower than Linux’s, but I don’t think it could explain this much of a difference.
*You may assume I did the obvious things before migrating the DB from metal to VMware: restart the server, dump+reload the DB, etc.
As Jeff Atwood learned the hard way, unfortunately you need to take additional steps in order to backup live virtual machines.
If you have a WinXP VM that you just boot up from time to time to check IE6 compatibility, Time Machine and SuperDuper have you covered there. That’s because your image spends most of its time suspended, with all its bits on-disk in a consistent, restorable state.
Sadly, such is not the case for long-running VMs (such as server VMs).
When Time Machine or SuperDuper copies your live VM’s files, it may or may not get a copy of the VM’s files in a consistent state. Inconsistent state == failed restoration.
That’s bad.
My theoretical solution is to take a “backup snapshot” before backing up a live VM, with the belief that doing so will force the on-disk representation to be consistent prior to backup.
I must emphasize this is only my theory — I don’t have VMware’s source code. While taking a snapshot must result in a forced virtual machine image serialization, for all I know that snapshot is stored on-disk in an inconsistent fashion until the virtual machine is shutdown or suspended.
I’d be surprised if that were the case, but there’s your disclaimer.
※ ※ ※
My theory detailed above is only one piece of the puzzle: with backup, automation is king.
Fortunately VMware Fusion 2 and later come with a vmrun command that gives us a basis to automate the taking of snapshots for backup purposes.
Here’s a small Ruby script that dynamically discovers all running VMs, creating a new snapshot unimaginatively named “backup snapshot”, deleting any old snapshots if present:
#!/usr/bin/ruby
VERBOSE = true
BACKUP_SNAPSHOT_NAME = 'backup snapshot'
def vmrun(subcmd, *args)
args.map! {|arg| "'#{arg}'"} # Single-quote all args.
cmd = "'/Library/Application Support/VMware Fusion/vmrun' #{subcmd} #{args.join(' ')}"
puts "$ #{cmd}" if VERBOSE
cmdresult = %x[#{cmd}].chomp
puts "=> #{$?} #{cmdresult}\n\n" if VERBOSE
cmdresult
end
vmListStr = vmrun('list')
vmListStr.split("\n")[1..-1].each {|vmxPath|
vmrun('deleteSnapshot', vmxPath, BACKUP_SNAPSHOT_NAME)
vmrun('snapshot', vmxPath, BACKUP_SNAPSHOT_NAME)
}
I’ve hooked up my servers’ nightly SuperDuper Copy Script to use this script (Options… > Advanced > Before Copy > Run shell script before copy starts).
Now my SuperDuper backups should all have cleanly-restorable backups of all my active VMs, no further configuration required.