In an interview with The Observer newspaper, Dr Ian Brown from the Oxford Internet Institute who is writing a report on anonymous datasets for the European Commission warns that "criminals could identify individuals through mobile phone data and use the information to track people's movements and find out when they are away from home". His concerns have been peaked, it would seem, by the problem of statistical de-anonymisation.

Statistical what? Well, there have been great advances (although that's not perhaps the right word) in the last couple of years when it comes to the re-identification of individuals whose anonymity is supposedly guaranteed through the use of anonymous datasets. The concept is a simple enough one, take a load of data and strip out the personally identifying information and you are left with great source material for statistical research without the privacy of the individuals whose data appears within it being compromised.

Except it would seem that it is now quite possible to do just that, compromising the privacy of those individuals by piecing together the information like a jigsaw using some frankly rather frightening de-anonymisation algorithms.

It's true to say that the notion of anonymous datasets would appear to have been well and truly smashed to pieces. The statistical de-anonymisation process used by one US-based research team, for example, enabled them to take a publicly available and supposedly anonymous list of the movie ratings of some half a million Netflix subscribers and match movie preferences with individuals they also identified from the available data.

OK, no big deal you might say, after all who cares if the world knows that I like Brazil and My Own Private Idaho but am no great fan of Close Encounters of the Third Kind? The truth is nobody, apart from maybe Spielberg and even that's a long shot, gives a hoot about my movie viewing preferences. However, it is the principle of supposedly anonymous datasets being nothing of the sort that is at stake here, and plenty of people care about that.

What if the data that had been extracted came from a 'sanitised' medical records database being used to provide government health statistics for example, would you be a little bit more concerned then? Even when it comes to that Netflix dataset, the researchers claimed to have "successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information".

Now how about if I tell you that this is no new revelation, no just discovered flaw in the system, no breaking news story? How about if I tell you that the academic paper being referred to here was actually published way back in 2008? In the Robust De-anonymization of Large Sparse Datasets paper, authors Arvind Narayanan and Vitaly Shmatikov present what they called at the time "a new class of statistical de-anonymisation attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on".

About the Author

As Editorial Director and Managing Analyst with IT Security Thing I am putting more than two decades of consulting experience into providing opinionated insight regarding the security threat landscape for IT security professionals. As an Editorial Fellow with Dennis Publishing, I bring more than two decades of writing experience across the technology industry into publications such as Alphr, IT Pro and (in good old fashioned print) PC Pro. I also write for SC Magazine UK and Infosecurity, as well as The Times and Sunday Times newspapers. Along the way I have been honoured with a Technology Journalist of the Year award, and three Information Security Journalist of the Year awards. Most humbling, though, was the Enigma Award for 'lifetime contribution to IT security journalism' bestowed on me in 2011.