Index of /node/1014

Name	Last modified	Size

Parent Directory		-
README.html	2023-08-06 21:04	12K
info.json	2023-08-06 21:04	35
tags=auto	2023-08-06 21:04	0
tags=meta	2023-08-06 21:04	0
tags=ragtag	2023-08-06 21:04	0

Coping with Copies

I like to start every year with some interesting goals/projects to occupy my free time. Sometimes I blog a bit about it, but for the most part I just spend my time doing the thing I planned to be doing. This is one of those times I want to talk through what I'm thinking, because the problem I'm looking at for 2020 is one that has been hashed and rehashed over and over again from too many different angles to count: robust digital data storage.

You see, I not only like to have a yearly challenge, every decade or so I like to look at some larger issue that has bugged me over the last 10 years and think about a way to tackle it that will satisfy me for at least the next 10 years. One of my previous efforts was when I tackled email abuse (aka, spam), and my solution to that remains very effective. I'm hoping I'll now be able to devise a good way to manage my files.

It's not like I have vast amounts of data, either. What I have, though, is a veritable soup of roughly 8TB of files spread over 10 storage devices. Some of it is backups (like my 1TB Time Machine disk for my Macs), some of it is more ad-hoc incremental copies of various server setups, some of it is ancient media files I may never even need again (Remember .mod/.s3m files? I've still got a questionably-useful archive of 108 of them taking up 18.9MB of space.), some of it is version controlled repositories that extend back to archive software that no modern OS install could run, and some of it is one-off scripts that were useful to automate some small task for a client I haven't spoken to in over 20 years (but, hey, you never know when that code could become handy to have around again).

In truth, if you boil it all down, I probably only have about 500GB of unique data that needs to be managed. Having that duplicated 2 or 3 times to keep it safe from loss would still only account for 2TB of space. That other 6TB I have is essentially wasted. I mean, have you ever migrated an account to a newer version of macOS and/or looked at what gets stuffed into a Time Machine backup? There's so much outdated cruft and useless cache files that are constantly being processed that I have to take it upon myself to do some "refactoring" (to borrow a term from software development).

And that's how I started off 2020. For these first three months, I had essentially been doing a survey of all the projects I could find that were related to usefully managing a largish collection of files. There were many that looked promising, especially some things based on Git (like git-annex or bup), but for all their promises of being able to scale, they still were choking with memory allocation errors on my Raspberry Pi target platform. The fundamental problem with all these pre-packaged solutions is that, in an attempt to be fool-proof, they try to do too much, and so they fail in very basic ways even when they're not presented with "big data" levels of difficulty.

So, as is the theme of this web site, I'm going to search for an impossibly stupid solution to this issue. Of note, some years ago I started work on a much more ambitious project called a Meta Object Manager (MOM). Instead of going that same "do too much" route again myself, this time I'm going to look for a Retro Avant-Garde way to get the job done. That means starting with some basic Unix shell scripting (that covers Mac and Linux; I leave Windows support as an exercise to the reader) and, in particular, the idea of doing set operations on file metadata.

Manifest Destiny

My goal, therefore, is to process all my existing data into a "unified" archive. Note that doesn't necessarily mean a centralized solution, but it does mean that the processing done should be relatively platform agnostic such that the metadata format(s) can be layered on top of just about any filesystem. Likewise, I want to dispel the old notion that "replication/redundancy is not a backup", and similar notions about the need for active/intrusive version control management. That is to say, ideally, a single archive should be able to serve as the source of all historically meaningful data, and it should be relatively straightforward to reconstruct as many different "views" on that data as are needed.

With that as our destination, we now look at our starting position. As I said, it's a mess of files scattered across many storage devices. The best thing to do for now is to start simple and only scale up the complexity when necessary. So we'll just examine the files on one device, and we'll stick with just looking at the data itself (i.e., ignoring ownership and other metadata issues). A common way to do this for many existing backup and version control systems is to do content-based identification via hashing. This is also a reasonable starting point for our purposes, so let's start by creating a catalog/index/manifest of a directory of files we have on our initial device:

% for x in * ; do find "$x" -type f -exec openssl sha256 -r '{}' \; | sort > "$x.manifest"; done

Selection is done with the common Unix find command. Abstractly, we just need some way to get a list of files to work with. A simple find on a directory gives us that, but we could also use it to be more focussed if we just wanted to, say, concentrate only on files greater than 10MB, or just MP3 audio files, or whatever. For convenience, I'm also going to sort those files in a uniform manner to ease future operations.

On those files I execute the same hash algorithm I use to identify my external content here. My work on the MOM was long enough ago that I used MD5 for essentially the same purpose. Many modern systems like Git and some UUIDs are based on the 160-bit SHA1 algorithm. While that's probably still going to be fine for a while for the task of identifying items, I felt it was worth it to use the 256-bit algorithm to not only be cryptographically secure, but to further minimize the chance of a collision occurring on the off chance I'm going to be tracking a dataset of many millions to trillions of files over the next 10+ years. Also, the aim is to only calculate this value once for any particular file, so even if it is slightly more computationally expensive, it should be worth it in the long run.

Reduce, Reuse, Recycle

Now that we have some metadata files that identify a bunch of content data, let's think about what we can do with it. My natural desire is to start shoving it into a database (that's what I did when I implemented the MOM), but I'm going to put off adding that kind of complexity until I reach a point where I have enough data to justify it (e.g., 500K+ files). For now, I just want to derive a few specific lists from these manifest files.

function manifest_uniques() { uniq -w 64 "$1"; }
function manifest_duplicates() { uniq -w 64 --all-repeated=separate "$1"; }

% for x in *.manifest; do manifest_uniques "$x" > "$x.uniq"; done

First, I want to determine the file content which is truly unique, and the uniques function does that. Along with the original manifest file, this is the set of items that will allow me to restore the entire contents of the full archive. Second, I may want to purge any "unnecessary" duplicates, and the duplicates function gives me that list (grouped by hash value). An alternative way to filter those duplicates is to use the list of unique items from the first function and run it through a (possibly GUI) diff tool that allows you to compare the two versions to see which lines have been removed. Either way, if you choose to delete any duplicate files, the manifest needs to be updated (either manually or regenerated), along with any lists derived from it.

Very Unique

Having distilled each manifest list down to its unique items, we can now think about combining them in order to create larger archives.

function manifest_join() { sort "$1" "$2" | uniq -w 64; }
function manifest_missing() { join -v2 "$1" "$2"; }
function manifest_merge() { MERGE_TMP=$(mktemp -p $(dirname "$1")) || (echo "$0: could not create merge file" && return 1); manifest_join "$1" "$2" > "$MERGE_TMP" && mv "$MERGE_TMP" "$1"; }

The join function is much like the previous uniques function, but operates on two manifests to combine them. The missing function is useful to see what items in the second/newer archive aren't already in the first/larger archive; most helpful when you're looking to whittle down a number of mostly-identical backups. And merge, our first file-altering function, essentially adds the missing entries from the second file to the first file.

% manifest_join UCloud.manifest.uniq UPrivate.manifest.uniq > UProtected.uniq
% manifest_join Archive.uniq UProtected.uniq > UniversalManifest.uniq
% manifest_missing UniversalManifest.uniq Subsume.manifest.uniq | wc -l
11
% manifest_merge UniversalManifest.uniq Subsume.manifest.uniq 
% wc -l UniversalManifest.uniq 
567074 UniversalManifest.uniq

So after running these relatively simple commands across all my storage, I find that I do indeed have over half a million unique files (although I'm not yet sure how much underlying storage that will actually represent). While it wasn't particularly quick to index 8TB of data, my Raspberry Pi with 1GB of RAM was able to do this without allocation errors. All in all, this has been a reasonably successful experiment, and I'm going to further explore using a system like this to better manage my storage.

Next Steps

The first thing to note is that all I've done so far is gather up a bit of metadata. While I have certainly been able to find and manually remove some unneeded duplicates, I still need to write code that uses the manifest file(s) to either create an archive directory containing just the unique files or re-create a directory tree from said archive. To be really useful, the metadata and the file content data need to interact.

Similarly, a must-have would be basic file operations (mv, rm, etc.) that also update the manifest entries for the item being modified. At some point, it might make sense to wrap this all up behind a FUSE interface.

As noted, 500K+ is an awful lot of items to be managing with just text files and shell scripts. I still don't quite feel the need to jump up to using a database again, or reimplementing things in Ruby or C or whatever, I am going to keep that in the back of my mind, though, as I build this out further. The important thing here is not the code itself, but the way in which we can better long-term manage file data and metadata. Stay tuned to see how this progresses, and let me know if you find this approach useful or have pointers to similar systems that already exist.

Permanent Links

Poll

Tech Corner

See Also

Coping with Copies

Manifest Destiny

Reduce, Reuse, Recycle

Very Unique

Next Steps