Busted: the anonymous dataset myth

happygeek 2 Tallied Votes 304 Views Share

In an interview with The Observer newspaper, Dr Ian Brown from the Oxford Internet Institute who is writing a report on anonymous datasets for the European Commission warns that "criminals could identify individuals through mobile phone data and use the information to track people's movements and find out when they are away from home". His concerns have been peaked, it would seem, by the problem of statistical de-anonymisation.

Statistical what? Well, there have been great advances (although that's not perhaps the right word) in the last couple of years when it comes to the re-identification of individuals whose anonymity is supposedly guaranteed through the use of anonymous datasets. The concept is a simple enough one, take a load of data and strip out the personally identifying information and you are left with great source material for statistical research without the privacy of the individuals whose data appears within it being compromised.

Except it would seem that it is now quite possible to do just that, compromising the privacy of those individuals by piecing together the information like a jigsaw using some frankly rather frightening de-anonymisation algorithms.

It's true to say that the notion of anonymous datasets would appear to have been well and truly smashed to pieces. The statistical de-anonymisation process used by one US-based research team, for example, enabled them to take a publicly available and supposedly anonymous list of the movie ratings of some half a million Netflix subscribers and match movie preferences with individuals they also identified from the available data.

OK, no big deal you might say, after all who cares if the world knows that I like Brazil and My Own Private Idaho but am no great fan of Close Encounters of the Third Kind? The truth is nobody, apart from maybe Spielberg and even that's a long shot, gives a hoot about my movie viewing preferences. However, it is the principle of supposedly anonymous datasets being nothing of the sort that is at stake here, and plenty of people care about that.

What if the data that had been extracted came from a 'sanitised' medical records database being used to provide government health statistics for example, would you be a little bit more concerned then? Even when it comes to that Netflix dataset, the researchers claimed to have "successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information".

Now how about if I tell you that this is no new revelation, no just discovered flaw in the system, no breaking news story? How about if I tell you that the academic paper being referred to here was actually published way back in 2008? In the Robust De-anonymization of Large Sparse Datasets paper, authors Arvind Narayanan and Vitaly Shmatikov present what they called at the time "a new class of statistical de-anonymisation attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on".

DeAnonym 0 Newbie Poster

I recently saw an even scarier attack where researchers used the browsing history together with information from a social network to de-anonymize users: http://honeyblog.org/archives/51-A-Practical-Attack-to-De-Anonymize-Social-Network-Users.html and http://www.iseclab.org/papers/sonda-TR.pdf

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.