Content from 2011-10

On Data Loss

posted on 2011-10-27 21:02:24

I've been meaning to blog more and have a more long-form, personal post in the works. Hopefully I'll get that out tonight. I don't reflect as much through writing as I used to and I miss that. Anyway, something interesting just happened at work. We have a beloved IRC bot at my office named olga. One of our favorite features of olga is that she'll write a haiku for us on demand. More precisely, we can give her a phrase that is five or seven syllables and she'll remember it. When we ask her to construct a haiku she picks two random 5-syllable phrases and a random 7-syllable phrase. The most impressive invocation of olga haiku I've seen to date is: "What we do in life / Is there a step I'm missing? / Inexorably". I wish *I'd* fucking written that. It's gorgeous.

So...I inadvertently deleted all of olga's sevens. Someone was asking for an example of how to *remove* an entry and I posted an example that apparently matched a perl regex to nuke the universe. So I trust perl even less now and I've never even actually used it. The fiasco resulted in this delightful github commit fixing the vulnerability. We modify olga's haiku database a few dozen times a day conservatively. It's like a collective cultural store for our exceedingly delightful and nerdy hackers. After the chaos and laughter subsided, a coworker and I grepped through our irc logs of the past few months searching for matches to the "haiku add.*sevens" pattern. We'll probably be able to restore things decently enough. There are server admins with still more extensive logs we may be able to get access to...but that's not what's interesting about this to me. What's interesting about this to the emotional response we had to the whole event which made me think that data loss with computers in modern times can be somewhat akin to a phantom limb sensation.

Think about it. You have *no* idea how much data you have and, if you're a quirky statistical outlier archivist-type like me, only a vague idea of what the most important and recent elements in that dataset are. You probably don't even know *where* your data is. It's on Google's servers, Amazon's servers, your phone, your PC (possibly several) and maybe even your personal server. Even if you're not a hacker/computer type, you've got data coming out of your ears. In modern society, effectively *EVERYONE* is an information packrat. And the question is, what cognitive and emotional burden does that sort of behavior result in?

The most interesting part to me is that I suspect many cases of data loss aren't troublesome unless you're aware of them. We have *so much* data that unless you're sure an operation accidentally deleted data you didn't intend to lose, you might never miss it. How big is olga's 7-phrase dataset? I have no idea. But I honestly expect it's in the high hundreds if not thousands of phrases. If we lost 100 of them...would we ever notice? Doubtful. But *knowing* that a bunch of data was accidentally lost feels like losing property except for the fact that it's hard to assess just what value that property had, what measures can or should be taken to ameliorate the event, what meaning the loss really has.

This has funny implications for SciFi authors as well. The idea of memory diamond or some sort of storage medium for someone to record their entire life (lifelogging) is predicated on the fact that no data would ever be lost...because at that scale, you simply don't know what is important and what can be lost because the *value* of the data is context-sensitive, especially temporally. Moments of nostalgia make it impossible to say, "Scrap this, lose that", and if you don't know what data has been lost then the whole thing is suspect. And don't laugh *too* hard, between Steve Mann, the MIT Media Lab and folks at Yale and elsewhere there's a decent amount of research towards making lifelogging possible. Anyway, I just wanted to get some of these thoughts down. It's definitely been another fun day at the office. :)

Editor's Note: Discussions with a coworker have reminded me of two things:
1) We really are likely statistical outliers much more than this post suggests. I tried to hint at this possibility with the archivist bit but it's never the less important to reinforce.
2) People obviously know the difference between important data and unimportant data. This interesting phantom limb-like effect really seems to come out with tons of miscellaneous, less important data (IRC logs, browsing history, etc) that you keep because you can rather than essential information or emotionally substantive data (photos, videos, MP3s, certain emails, etc).

Unless otherwise credited all material Creative Commons License by Brit Butler