shasum of a directory in macos

Posted on

QUESTION :

I’m writing a shell script that uses the shasum to check if the contents of a directory have changed.

On Linux and FreeBSD, the shasum have the same behavior when I do shasum <directory> however, on MacOS the shasum give me hashes for files only.

FreeBSD

$ shasum CONTENTS/
7f986e5e5289c59db1bba48df92ffe4707830aaa  CONTENTS/

Linux

$ shasum CONTENTS/
7f986e5e5289c59db1bba48df92ffe4707830aaa  CONTENTS/

MacOS

$ shasum CONTENTS/
shasum: CONTENTS/: 

How could I calculate the hash of a directory in MacOS?

TRY 1: Using TAR with pipes

Tried to use but seems that this tar option doesn’t work on MacOS.

tar cO CONTENTS/ | shasum
tar: Option -O is not permitted in mode -c
da39a3ee5e6b4b0d3255bfef95601890afd80709  -

TRY 2: Using FIND/EXEC

It was consistent between MacOS and FreeBSD, but Linux returned a weird hash

find CONTENTS -type f -exec shasum {} ; | sort -k 2 | shasum

Linux

c2ddb9bc5f543e956f5cdcc76750cb78cc5f26f3

FreeBSD

3ac2a9d4e2fc5d2d2ec3c7f612e680990cc35824

MacOS

3ac2a9d4e2fc5d2d2ec3c7f612e680990cc35824

OTHER FINDINGS ON TAR

tar would be excellent as it “archives” a folder and then I could shasum it, however the order of how tar “walk” the folder structure is not consistent across operating systems. As some helpers mentioned in the comments that I should use the same version of tar in all systems.

Just an example, on system 1 I have this order:

drwxr-xr-x  0 root   wheel       0 27 Jul 07:23 usr/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f1/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f1/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f1/f0/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f1/f0/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/f0/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/f0/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/f1/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/f1/aaa

and on system 2 I have the following order:

drwxr-xr-x  0 root   wheel       0 27 Jul 07:23 usr/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f1/
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/f1/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f3/f2/f1/aaa
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/f0/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f2/f1/f0/aaa
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f1/aaa
drwxr-xr-x  0 root   wheel       0 27 Jul 07:25 usr/f1/f0/
-rw-r--r--  0 root   wheel       0 27 Jul 07:25 usr/f1/f0/aaa

From a tar standpoint it if all good, but due to the order, the shasum produces a different hash.

CONCLUSION

shasum is consistent among Linux and BSDs to check an individual file hash, but, when it comes to directories the consistency happens only on MacOS and FreeBSD, perhaps due to how files are sorted.

If sorting is enforced using the find command, consistency is only obtained in FreeBSD and MacOS, however this method is time prohibitive as it takes a significant amount of time to calculate the hashes for every single file and then the whole structure hash.

Using tar to create a temporary file and then doing a shasum also found to be inconsistent between Linux and BSDs, perhaps because of difference in the archiving method.

I think the only way forward is to redesign my solution.

ANSWER :

mtree is the tool you want.

Suppose:

$ mkdir foo
$ date > foo/date1; sleep 3
$ date > foo/date2; sleep 3
$ date > foo/date3
$ grep . foo/*
foo/date1:Wed Jul 24 16:11:32 PDT 2019
foo/date2:Wed Jul 24 16:11:35 PDT 2019
foo/date3:Wed Jul 24 16:11:38 PDT 2019
$ find . -ls
7318841   0 drwxr-xr-x    3 admin    staff     102 Jul 24 16:11 .
7318847   0 drwxr-xr-x    5 admin    staff     170 Jul 24 16:11 ./foo
7318849   8 -rw-r--r--    1 admin    staff      29 Jul 24 16:11 ./foo/date1
7318851   8 -rw-r--r--    1 admin    staff      29 Jul 24 16:11 ./foo/date2
7318853   8 -rw-r--r--    1 admin    staff      29 Jul 24 16:11 ./foo/date3

Create a reference manifest of directory foo and store it in foo.mtree:

$ mtree -c -K sha256digest -p foo > foo.mtree

Now go and mess with any file in that directory.

$ touch foo/date3

Run mtree again and pass it the manifest you created earlier, and mtree will tell you what
changed:

$ mtree -p foo < foo.mtree || echo fail
date3 changed
        modification time expected Wed Jul 24 16:11:38 2019 found Wed Jul 24 16:14:00 2019
fail

$ echo '$ date > foo/date2' >> bar
$ mtree -p foo < foo.mtree || echo fail
date2 changed
        modification time expected Wed Jul 24 16:11:35 2019 found Wed Jul 24 16:19:40 2019
        SHA-256 expected c76a568f08d98c2830f2fdfb42415c3ec15341b8741450d4bbd863f1d5c4c691 found ddcf8d07785bfe4d031a989339835dc3b8b44653019568dcee612c44fc8e2f70
date3 changed
        modification time expected Wed Jul 24 16:11:38 2019 found Wed Jul 24 16:14:00 2019
fail

Any files missing from foo or added since the manifest was created will also be reported:

$ mv foo/date1 foo/date4
$ mtree -p foo < foo.mtree || echo fail
. changed
        modification time expected Wed Jul 24 16:11:38 2019 found Wed Jul 24 16:21:38 2019
date2 changed
        modification time expected Wed Jul 24 16:11:35 2019 found Wed Jul 24 16:19:40 2019
        SHA-256 expected c76a568f08d98c2830f2fdfb42415c3ec15341b8741450d4bbd863f1d5c4c691 found ddcf8d07785bfe4d031a989339835dc3b8b44653019568dcee612c44fc8e2f70
date3 changed
        modification time expected Wed Jul 24 16:11:38 2019 found Wed Jul 24 16:14:00 2019
date4 extra
./date1 missing
fail

Rmlint will do what (I think it is) you want.

Relevant points:

  • It doesn’t use SHA by default, but can be told to.
  • It can be installed on MacOS via homebrew.
  • By default it doesn’t calculate a checksum for a single specified directory. It can be told to calculate checksums for all directories from a given starting point, as a way of finding “duplicate” directories below that point. But as a side effect, will also do exactly what you seem to be asking.
  • It may be overkill for what you’re looking for, and may take a while for you to figure out the best option flags to use, but is quite robust.
  • Figuring out what flags to use might be tricky. Getting directory checksums is easy enough, but getting it to not do other things, can be tricky. (Although to be clear, it doesn’t actually modify anything. At most, it generates a shell script, that you can manually run later, to modify things if desired. What it seems you need, is the JSON and/or CSV output files, which will give you the directory checksum you’re looking for.)

I use rmlint in a bash script to find duplicate directories. Here is a command that will minimally do what you want, and as little else as possible:

rmlint "base/dir/to/start/from" --see-symlinks --hidden --algorithm=sha256 --types=none,duplicatedirs --no-backup -o csv:log.csv

Leave a Reply

Your email address will not be published. Required fields are marked *