ZFS, surprise resilvering, dumb tricks
Once upon a time I got a 5-bay QNAP, and all was well until it got sulky and one of its SATA connectors decided to give up on life. Drafted in its stead, I’ve had a ZFS NAS setup for ages–going back to Fall of 2019. 8 x 4TB drives, Norco ITX-S8, 32GB ram, Ryzen 2400G, nothing too fancy. NixOS, ZFS with raidz2, great.
I ran a handy little tool I slopped up and lo-and-behold all the drives are nearly at five and a half years of online time; this is solidly in their golden years. I’d been meaning to upgrade the array to larger disks, since a good amount of space is going (cheerfully!) to error-correction and parity, and so had begun collecting replacement disks in anticipation of an upgrade. Rebuilding the array would be time-consuming and annoying and so I’d been putting it off.
Well, the week before Christmas and finally hit with inspiration, I started reorganizing all of my different linux ISOs by source and genre and album and whatnot, and also checking through old backups. Out of curiosity, I ran my health script again, and now SMART errors cropped up. Worse, ZFS had now decided that one of the disks was degraded, and so the whole zpool had begun its swan song.
It appeared that it was now time to do something.
Interlude: ZFS? Is that some kinky BSD thing?
ZFS is one of the projects that escaped from the lab over at Sun Microsystems. It’s probably the last filesystem you’ll ever need unless you’re doing embedded stuff or networked filesystem stuff. While I am not an expert in all the things it can do, the things that it can do that I care about are:
- Data integrity and checksumming. Everything in the filesystem is monitored for bitrot, and even if you can’t fix the problem you at least can find out that it happened.
- Super-easy JBOD. By default, you can create a pool of storage out of whatever disks you have lying around. More advanced stuff requires forethought, but if you just want to make a JBOD this is great.
- Built-in RAID features. If you’re willing to do a tiny bit of work and sacrifice some storage (which I emphatically am, for reasons that will become obvious as this tale continues), you can easily configure disk mirroring and striping and parity, and use something called RAID-Z.
- Snapshotting. You can take easy snapshots of the filesystem almost instantly at any point, and fall back to them. Further, you can trivially explore those as read-only directories without any mounting shenanigans.
The organization of ZFS is:
- You have some pile of physical disks.
- You combine one or more of the physical disks into a vdev (virtual device). This is where you do things like raid striping and mirroring.
- You then combine vdevs into pools.
- Pools can then be chopped up into one or more datasets or ZVOLS (z volumes, which are block devices that can be mounted as though they were a physical disk and another FS layered over them…or used as the backing store for VMs). Either of those is where we apply encryption, compression, and so forth.
- Finally, datasets are mounted and act as normal directories.
So, you can think of it as disk -> vdev -> pool -> dataset (or zvol).
To add to the fun, you can further divide datasets into sub-datasets (and sub-sub-datasets, ad infinitum), which can have different properties (keys, compression, max sizes, etc.).
The “Doing Something”
I had the drives (8 new 8TB drives, accumulated over a few years and waiting–boxed–for their moment). I had the NAS (only one disk failing!). I had Claude to spot-check me on some things, and manuals for the rest. It was time to go.
Step zero was to setup a constant watch of the array status via watch -n 1 zpool status.
The first step was to yank the failing drive, swap a new one into its caddy, and reinsert. Two choices by Past Chris here were very helpful:
- Deciding to purchase a case that used hard-drive caddies/sleds instead of needing to open the case and unscrew things (so, I got to leave the NAS in situ).
- Deciding–and this was a lifesaver–to print out little labels with the leading parts of the serial numbers and place them next to the drive slots. Failure to have labeled drives would’ve been a big annoyance and would’ve led to trouble.
The second step was to run zpool replace tank ata-INITECH_IV8000 /dev/disk/by-id/ata-INITECH_VI9876 (ignore the silly serials). Theoretically we could’ve/should’ve used zpool offline tank ata-INITECH_IV8000, but yanking the disk out and replacing it left ZFS in a state to be like “oh, um, guess we’re replacing disks!”.
The thing to note about all of this process is that each disk, being 4TB, took a long time to replace (resilver, in the terminology of ZFS). Hours and hours–pop it last thing in the wee hours of the night, and maybe it’d be done by afternoon. This was annoying to schedule around (especially as my holiday trip drew near) but otherwise uneventful.
Uneventful until it’s not
Except. Except.
The resilvering process involves basically reading and writing across all the other disks in the array, generating a tremendous amount of I/O traffic. Because of how hard-disks are implemented, this creates a tremendous amount of heat. The case I’d chosen did not have good air circulation.
And so, about 97% of the way through my third disk swap, another disk started throwing errors.
I checked smart, and sure enough that disk had registered a max temp of 91C. Disk drives are not meant to run at 91C–you really want them in the 30s or worst-case 40s if you can avoid it. The poor disk was cooking, and due to that cooking it was having reallocated sectors (a number that kept climbing every time I checked on it, which was terrifying). That sector reallocation additionally slowed down the whole process and so I watched as the array crawled through the finish line, sweating the entire time.
Investigating the cooling problems, I discovered that both fans on the back had died and stopped turning at some point in the past, and so I swiped my partner’s office fan to blow on the array while doing all the work. I’d later swap the fans out for Noctuas and then pin them at max, keeping thermals reasonable.
At the very end, I have a single failure on one of the new drives (according to ZFS; smart thinks it was fine), and cleared it out via zpool clear tank. It’s been quiet ever since.
Expanding the pool
So, at this point, I’ve gotten all the drives swapped over. Since the new drives are all 8TB instead of 4TB, I need to expand the raid. This is a relatively recent feature in ZFS, and I’ve been waiting on it for years. This, finally, was the time to do it.
You might expect that this would be complicated. However, you see, ZFS is magic:
# zpool set autoexpand=on tank
Then, zpool scrub for good measure, and we’re all set. Expanded pool is 12.8TB used and 28.4TB available. Great success!
Conclusion
ZFS is awesome, and higher raidz is good for stress levels even while your array is literally cooking itself.