Is there a tool like diff -r that compares a directory tree against a manifest file using hashes?

Posted on

Problem :

Here’s my situation. I have two cold-storage archive volumes that (should) contain identical sets of data. These volumes contain infrequently accessed backups. I am concerned that, eventually, bitrot will get to one or both of them and subtly corrupt the data contained within. I know I can diff -r the two volumes and find files that have changed or disappeared between the two, but I get no helpful indication about which volume has the “good” copy. These are USB disks, and converting them to something like ZFS seems… onerous.

What I’d like is a tool that will recursively walk the directory tree and write a manifest file containing the path and filename along with a hash of the file’s contents. I’d run this tool immediately after writing the data to each volume, and store the resulting manifest file on warm storage, perhaps under revision control of some sort.

From this file I’d like to be able to run something that works exactly like diff -r — it would tell me if files were added, removed, or their contents changed. Only instead of comparing one volume to the other, it would compare one volume to the known-good manifest file. Using this method, I should be able to tell if the data I’m reading off the disk months/years in the future is identical to the data I originally put on it.

I would have to think something like this exists already. I can get something approximating a manifest file using:

find /mnt/my-volume -type f -exec md5sum {} + > manifest.txt

but so far I haven’t come up with a good way to parse this file and check each hash recursively. Also, somewhat less importantly, this won’t tell me if an empty directory appeared or disappeared. (I can’t think of why it would matter, but it would be nice to know that it occurred.)

Am I on the right track with this, or is there a more appropriate tool that can do this type of thing?

Solution :

You’re right, such a tool does exist already. While I see that your post is tagged ‘linux,’ perhaps a BSD-oriented solution will be edifying.

FreeBSD’s mtree(8) utility can do exactly what you are asking.


$ find .

To create a manifest of that file heirarchy, including a sha256 hash of every file, one would:

$ mtree -c -K sha256 > /tmp/manifest.txt
$ cat /tmp/manifest.txt
#          user: diego
#       machine:
#          tree: /data/home/diego/foo
#          date: Wed Mar 28 10:31:17 2018

# .
/set type=file uid=1001 gid=1001 mode=0710 nlink=1 flags=uarch
.               type=dir nlink=5 time=1522257963.738221000

# ./a
/set type=file uid=1001 gid=1001 mode=0600 nlink=1 flags=uarch
a               type=dir mode=0710 nlink=2 time=1522257932.680802000
    file1       size=29 time=1522257932.682389000 
# ./a

# ./b
b               type=dir mode=0710 nlink=2 time=1522257937.929131000
    file2       size=29 time=1522257937.930666000 
# ./b

# ./c
c               type=dir mode=0710 nlink=2 time=1522257942.064315000
    file3       size=29 time=1522257942.065882000 
# ./c

One can then verify the file heirarchy against the manifest by piping the manifest into mtree:

$ mtree < /tmp/manifest.txt || echo fail

Added, deleted, renamed or modified files will cause the verification to fail:

$ touch foo
$ mtree < /tmp/manifest.txt || echo fail
.:      modification time (Wed Mar 28 10:34:56 2018, Wed Mar 28 10:37:01 2018)
extra: foo
$ rm foo; touch b/file2; mtree < /tmp/manifest.txt || echo fail
.:      modification time (Wed Mar 28 10:34:56 2018, Wed Mar 28 10:39:39 2018)
        modification time (Wed Mar 28 10:25:37 2018, Wed Mar 28 10:39:39 2018)
$ mv c/file3 c/FILE3; rm a/file1; date >> b/file2; mtree < /tmp/manifest.txt || echo fail
.:      modification time (Wed Mar 28 10:34:56 2018, Wed Mar 28 10:39:39 2018)
c:      modification time (Wed Mar 28 10:25:42 2018, Wed Mar 28 10:41:59 2018)
extra: c/FILE3
        size (29, 58)
        modification time (Wed Mar 28 10:25:37 2018, Wed Mar 28 10:47:31 2018)
        sha256digest (0x9f7a0a49475bb6f98e609a4e057f0bc702c5e4706be5bd656a676fd8d15da7ef, 0x569c17bd1a1ca2447fd8167f103531bf3a7b7b4268f0f68b18506e586e7eea94)
a:      modification time (Wed Mar 28 10:25:32 2018, Wed Mar 28 10:41:59 2018)
./a/file1 missing
./c/file3 missing

The md5sum -c manifest.txt would honor the paths stored in manifest.txt. The find program substitutes for {} the complete path to the file found including any search location specified at find command line, i.e. for the file ./a/b/c/d/e it will substitute the same ./a/b/c/d/e for command
find ./a -type f -exec md5sum {} ;

The possible issue is absolute paths, so the more appropriate ‘manifest creation command’ is:

cd /mnt/my-volume; find  -type f -exec md5sum {} + > manifest.txt

however, you always could fix the paths with sed inside mainfest.txt

Leave a Reply

Your email address will not be published. Required fields are marked *