How (not to) upgrade your hard disks

This should have been easy.  But like everything “important” that I try to do, it ended up as a bit of a nightmare.  The problem was simple: 300GB disks too small, need more space.  The solution was also simple: Buy three shiny new 3TB disks and fit them, copy data over, job done.  Simple?  Of course not…

The first thing was, will my computer support them?  Some of you will know that Windows really really hates drives bigger than 2TiB in size and the only way to make them work is to (a) use 64-bit Windows Vista Service Pack 1 or later and (b) use a motherboard with a UEFI instead of a BIOS, a compatible SATA disk controller, and a GPT partition table (rather than the old-fashioned MBR style) must be used as well, since MBR partition tables do not support drives bigger than 2TiB.  I do not have a motherboard with a UEFI, but thankfully neither do I run Windows on this particular server — it runs Linux (Debian 6.01 Squeeze to be precise).  This is a plus, because it means that using GRUB 2 I do not have to worry about any such problems, since it already supports GPT partition tables, and 64-bit LBA, and everything else you need to actually boot a 3TB disk.  Everything fine so far then!

So, I went ahead and bought myself three hard disks – one Seagate, and two WD Caviar Green disks, one of which was to be the backup disk.  After suitably “de-green”ing the WD drives (by using the idle3ctl utility, available from Sourceforge, which is a Linux version of Western Digital’s own wdidle3 utility) and turning the auto head parking off, I got to work.  All four drives were fitted, and I created an EFI System Partition (for future use, as I don’t have a UEFI motherboard yet), a BIOS Boot Partition to store GRUB 2 on (this is necessary with GPT partition tables since there is no ‘spare’ space to store the boot code, so I normally create a 1MiB partition to store that, and the rest was a RAID partition to store the logical LVM volumes.  All seems reasonable so far, taking care to align the partitions to MiB boundaries since the Western Digital drives are 4K physical sector drives (but the Seagate one claims not to be).

So, backups done, I shut down the machine, pulled the power, and installed the new disks as SATA3 and SATA4 on the controller so that the existing Maxtor 300GB disks would boot up as normal, which they duly did.  Once the system was booted, it was time to configure the RAID-1 array on the new disks, and then to move the data across.  I’d been practising this the week before on a kvm virtual machine so that I’d know what to do if it went wrong.  Couldn’t go wrong twice, could it…?

Well, yes it could.  The first “mistake” I made was to install the old system in such a way that both the / (root) partition was on a logical volume.  Normally, this wouldn’t matter, but it tends to matter when you’re trying to move it to another volume.  When I tried this before, it seemed to work, but this time something went wrong.  I issued the fateful “lvm pvmove” command to move the LVMs from one physical disk to another, everything stopped.  Oops.  I had started the process and then gone to bed – and when I got up the next morning it was still going and nothing was printed on the screen (even though I had verbose on).  What should I do?  At worst, I could just restore the backups (even though it would take ages)…  I ended up pushing the reset button.  The machine then failed to boot, but it did make it into the initramfs (since it couldn’t find the RAID array).  This is the bit where I think you’re supposed to panic!

Thankfully, I managed to reassemble the RAID array from the initrams and then ran lvm pvmove –abort to stop any already running moves in process.  Personally I don’t think it even got started, so I took the risk and ran the lvm pvmove command again from inside the initramfs (which is probably a good place to do it since no filesystems are mounted at this point).  It started.  It printed percentages.  It was going.  And then I had to go to work…

Got home to discover that it appeared to have worked.  All the LVs had been successfully moved, or so it said.  So the final thing to do was to get the system booted properly, by mounting the root volume and exiting the initramfs, and then (to cut a long story short) checking that /etc/mdadm/mdadm.conf had the right info in it, running update-initramfs -u once booted to make sure it detected the new RAID-1 array, then I installed GRUB 2 on the new disks.  And rebooted, and held my breath, and ….

Phew!  It booted!  Thank goodness for that!  Then I had to shut the machine down again to fit the hotplug SATA caddy which the backup drive went into, and also the new SATA DVD writer.  Once that was done, the old disks were removed, and the new machine was booted.

There was one final challenge – some idiot had set all his kvm virtual machines up so that none of the partition tables on the virtual machines were 4K sector aligned.  So I ended up spending quite a few hours sorting that one out – and that’s another (long) story.

But I got it done, and now it’s working, and hopefully I won’t have to do that again for another 5 years… (by which time I expect we’ll all be buying 30TB drives for £100!)  And next time, I’ll be using the Debian Rescue disk to boot into before doing the lvm pvmove…