beyond hashes and diffs

Ages ago, long lost to bitrot, I had a discussion over at Danny’s place about the logical direction of file/P2P storage towards a model that’s based upon file comparisons — hashes and diffs — and the underlying problems that might be encountered along the way.

We were talking about the distinction between what you might call ‘commodity data’ — downloaded media, shared documents, applications, core OS files — and personal data, and how that distinction is meditated through different licensing models, but also by a different sense of personal attachment, even if, at heart, data is data, and your most treasured photos can, for all their uniqueness, be diffed against somebody else’s.

Break down the contents of most hard drives, and you’ll find that it’s commodity data, for the most part, that has grown to fill the space now available, a new version of the push-pull that once existed between OS and hardware. A thousand hours of decently-compressed music fills around 100GB; the same in HD video takes you up to a terabyte.

Dropbox attempts to deal with the profusion of commodity data — whether obtained through purchase or P2P — through its segmented binary diffs, and it’s this model that makes most sense for generic cloud storage, even if it raises potential questions of legality and security.

Apple has decided to do things a little differently. What makes iTunes Match clever isn’t simply that it finally has the corporate clout to deliver an authorised version of Michael Robertson’s ; it’s that while charging a tacit $25 amnesty fee (call it ‘iLaunder’), Apple can surreptitiously sidestep the question of how to deal with fifteen years of MP3 hoarding. While upload-based music storage services presumably have to deal with a multitude of digital versions in different bitrates, laden with the quirks of optical drives and encoding software, Apple offers a trade-in for a canonical, cloud-based version deliverable to all authorised devices.

However much Apple paid out to the record labels, it’s a model that’s intended to save them a huge amount in storage and bandwidth costs, and it sets a new precedent in how to approach commodity data online.

Still, I’m not getting rid of my FLAC archives.