CLI tool, written in Rust, to analyse the difference between two directory snapshots
I periodically create snapshots of large directory trees for backup purposes. The snapshotting procedure is basically
cp -R, so each snapshot is a full copy of the directory tree in question.
I archive these snapshots either on my internal disk or on an external one. In any event, I end up with a whole bunch of snapshots – all representing the same directory tree, but at different points in time.
What I like about this backup process is that the copy mechanism itself is simple and reliable. The snapshots are functional and self-contained, and I can access the backup data independent of specific tooling (or even independent of the operating system).
However, one thing that slightly worries me is that I might accidentally lose or mess up data in between snapshot events. Since I cycle and eventually purge old snapshots to free up disk space for new ones, this could subtly lead to data loss in the long run, without me noticing about it.
My goal was to get some insight into how a directory tree has evolved over time.1 For this purpose, I created a CLI tool that can analyse the difference between two snapshots (on file level), and summarises the changes in a concise report.
In this blog post, I demonstrate usage of the CLI tool, outline the underlying diffing algorithm, and discuss performance characteristics of the implementation.
The tool is called
snappdiff (short for “snapshot difference”), and is implemented in Rust. You can find its sources on GitHub (beware of rough edges, though!).
snapdiff compares two directory trees, and then summarises how many files are identical, and how many have been modified, moved, added, or deleted.
In terms of scale, the tool is build to support snapshots containing some 100.000 files of any size. This is roughly the dimensions that I need for my own backups, which can typically contain such a number of files, and a few 100 GB of data overall.
As example, let’s say we wanted to compare two snapshots of a particular directory tree, which have been taken one month apart from each other. With the snapshot folders being called
2023-10-01/2, the execution of
snapdiff would look like this:
$ snapdiff 2023-09-01/ 2023-10-01/
G M K B
TOTAL Snap 1 87,243 98,407,188,994
Snap 2 87,319 98,591,738,372
OF WHICH Identical 87,134 97,551,550,976
Moved 38 134,217,728
Added 87 234,881,024
Deleted 11 50,331,648
Modified 147 671,088,644 (+282,172)
Via an optional
snapdiff can also write a more detailed summary into a file (enumerating all files that don’t fall into the “identical” bucket).
The categories are defined as:
- Identical means both snapshots contain a file at the same path with the same contents.
- Moved means both snapshots contain a file with the same contents, but at different paths. (This is the same as if a file was renamed.)
- Modified means both snapshots contain a file at the same path, but with different contents.
- Added means the second snapshot contains a file whose path or contents is not present in the first snapshot.
- Deleted means the first snapshot contains a file whose path or contents is not present in the second snapshot.
snapdiff determines the snapshot diff without any historical information or file metadata, but solely on the basis of the file paths and contents, as they are on disk in the moment of execution.3
It performs a total of two directory traversals (one per snapshot), so each file is processed once. The algorithm goes like this:
- Traverse the first snapshot directory tree. For each file, compute the hash of its contents, and store it in a lookup table. (This initial lookup table allows lookup by path and lookup by contents hash).
- Traverse the second snapshot directory tree. For each file, compute the hash of its contents, and try to lookup the file by its path in the initial lookup table (the one from step 1).
- If there is a file at the same path with the same hash, count the file as “identical”.
- If there is a file at the same path with a different hash, count the file as “modified”.
- In any of these two cases, remove the file from the initial lookup table.
- Otherwise, store the file in a remainder list.
- For each file in the remainder list, try to lookup the file by its contents hash in the initial lookup table.
- If there is a file, count the file as “moved”. (Note: it’s impossible for the file to be “identical” at this point.)
- Otherwise, count the file as “added”.
- In any of these two cases, remove the file from the initial lookup table.
- For each file that is now still left in the initial lookup table, count it as “deleted”.
By the way, considering the intended use-case of
snapdiff, we can heuristically expect most files to be identical.
There are a few edge-cases, which the above algorithm manages to avoid. These are mainly related to file (contents) duplication, and the order in which the files are processed.
For example, say snapshot 1 contains a duplicate file (i.e., with same contents) at locations
/b. In snapshot 2, however,
/a has been renamed (moved) to
/c, so snapshot 2 contains
/c. A naive comparison could incorrectly count two moved files, instead of one identical and one moved file.
SNAP 1 SNAP 2
Undesired: a → b and b → c
Expected: b = b and a → c
A similar thing could happen if snapshot 1 contains a file at location
/b, which is then duplicated to
/a in snapshot 2, so that snapshot 2 contains two files
/b with the same contents. In this case, the algorithm should report one file identical and one file added, instead of one file moved and one file added.
SNAP 1 SNAP 2
Undesired: b → a and +b
Expected: b = b and +a
Preface: I carried out the following performance measurements on a machine with 3,5 GHz 10-core ARM CPU and a built-in SSD with ~4 GB/s read speed. The sample snapshot for testing comprises 100.000 files with an overall size of 100 GB.
For simplicity, all below measurements are gated on these reference numbers.
snapdiff has an overall throughput rate of around 2.5 GB/s. Processing the test sample therefore takes around 75 seconds in total – 35 seconds per 100 GB snapshot, plus a few seconds of additional overhead.
The implementation is designed to spread the work evenly across all available CPU cores, and each core operates at nearly full capacity (~95%) during runtime. The overall throughput rate scales almost linearly over the cores, so you can roughly say that each core contributes ~250 MB/s throughput.
Due to the expected volume of files and bytes, it’s not possible to keep all files in memory. Therefore,
snapdiff computes hashes of the file contents as basis for the comparison, and stores the hashes in memory.
However, for computing the hash, it’s actually not even feasible to copy one file into memory in its entirety, since the individual file size can easily be bigger than the available memory. Therefore, file contents is read chunk-wise, which means that the hashes are computed incrementally.
Memory consumption peaks at around 115 MB. The memory footprint consists of two aspects: the chunk size is 10 MB, so 100 MB of the process memory are due to 10 cores allocating a 10 MB chunk each. The rest of the memory consumption (~15 MB) is mainly for building up and holding the lookup tables.
Increasing the chunk size doesn’t have any measurable effect on performance.
As far as the CPU is concerned, the speed bottleneck number one is – by far – the hash computation. Hence, the choice (and implementation) of the hashing algorithm is pivotal.
snapdiff uses a CRC algorithm,4 which is relatively fast, and also guarantees even distribution of the hashes.
CRC is available with various hash lengths, and choosing the right length is a balancing act between computational speed and the risk of hash collisions. For a length of 32 bits, for example, there would be a 1% chance of a hash collision in a sample size of 10.000 files – which clearly wouldn’t be feasible for this application.
snapdiff uses a length of 64 bit, which results in a 1% chance for a collision in a sample size of 600.000.000 files. (For some extra safety margin to avoid collisions, by the way,
snapdiff also takes the file size into consideration when comparing files.)
It doesn’t make sense to use cryptographic hashing algorithms such as SHA or MD5, since these would be significantly slower. Cryptographic hashing would only be relevant, if there could be a malicious third party trying to fabricate a file whose hash would collide on purpose (e.g., to sneak in a file modification but make the respective file appear as identical).
Since we read the files chunk-wise, it’s critical to compute the chunk hashes in sequential order, before combining them into the eventual file hash. This constraint implies that we cannot trivially parallelise the hashing work on chunk level, but only on file level.
The influence of disk I/O on performance varies extremely: compared to the cost of hashing, disk latency is a rather minor worry for modern, built-in SSDs. However, read speed can be orders of magnitude lower on magnetic HDDs, in which case disk I/O becomes the main limiting factor.5
For example, when reading the snapshot data via USB 2 from an external HDD spinning at 7200 rpm, I can see the overall throughput rate dropping to ~50–150 MB/s. That way, the evaluation of one snapshot takes a whopping 30 minutes.
Moreover, and in contrast to SSDs, the throughput rate can also be quite erratic on HDDs: one issue is that performance can degrade if data isn’t read sequentially, because the disk’s head would have to jump around constantly.
In order to play somewhat nicely with HDDs,
snapdiff allows to configure the number of utilised CPU cores (via the
--worker flag). For most fine-granular control, it even let’s you specify the number of workers per snapshot side. So if one snapshot resides on an external HDD, whereas the other is on a built-in SSD, you can throttle
snapdiff down to one core for the HDD, while going full steam on the SSD.
snapdiff accounts for a few heuristics to optimise performance. One such corner case is when the snapshots contain both small files and also very large ones. In this scenario, it might occur that one of the large files happen to be processed at the very end. Since we can only parallelise across files, not across file chunks, this would create a situation where one CPU core is left hashing the last (large) file, while all other cores helplessly idle around (as the file queue is empty at this point).
There would be various approaches to avoid or alleviate this problem. What
snapdiff does is to perform an initial scanning step, which allows it to prioritise the hashing work by file size in descending order. That way, the hashes of large files are computed at the very beginning, thus making sure that there will be enough work towards the end to “fill the gaps”.
The performance overhead of the initial scanning step is barely noticeable (it only retrieves file metadata).
In my case, I’m primarily interested in diffing the latest snapshot with the second-to-latest one. ↩︎
When I create a snapshot, I usually name the snapshot folder (i.e., the copy of the original directory tree) like the date at which it was taken. ↩︎
The “stateless” comparison has limitations, of course: e.g., if you’d modify a file and then also rename (or move) it, then
snappdiffwould report that as one file deleted and one added. ↩︎
CRC means “cyclic redundancy check”, and it’s most widely known for failure detection in network or storage engines. ↩︎
External SSDs are somewhere in between HDDs and built-in SSDs, and their read performance largely depends on the characteristics of the connector interface. ↩︎