Filesystems (ext4, xfs, zfs, etc) are one of those things whose failure nobody really wants to think about. The difference between a hard disk failure and complete filesystem corruption is largely academic. However a filesystem has many failure modes and the scariest is silent corruption that goes undetected for a long time. Worst case scenario is that backups are rendered useless.
The long time solution to detecting and correcting minor filesystem issues is fsck. The tool has several limitations:
What seems to be standard practice is the following:
This has several obvious drawbacks:
Online fsck seems to impossible, because the state of the filesystem can change in ways that make the check wrong.
Databases have a similar problem: how to do a backup while the system is in operation. The solution there is to use filesystem snapshots. This is how I stumbled upon e2croncheck. The original from Theodore Ts’o is here. I found a revised version on GitHub by Ion
The script creates a read write snapshot of the filesystem. LVM uses a copy on write snapshot volume to track changes to the original filesystem. The script thens runs e2fsck on the snapshot which will report if there is actual corruption on the filesystem that needs to be repaired offline.
This seems like a better solution than the standard practice of ignoring the problem so I setup my next servers in the following way:
LVM snapshots are not without their problems. The big one is performance. There is overhead to the COW filesystem, but thanks to the Internet I found some benchmarks comparing performance with chunksize. The default chunksize is 4kB and increasing the chunksize to 64kB increases performance by 10x!
I also added ionice with e2fsck set to idle priority. So far the changes mean that the background check does not interfere with programs that are running.
The final version of the script is located here inside a puppet class to install the file and cron job.