Researchers in European universities solidify past research that proves just how easy it is to reconstruct anonymized data back into personalized data. This proves, once again, that people must be given control over their data and, even then, released data should be converted into synthetic data because, at the moment, too much detailed consumer data is being released, putting many people at risk.
“Researchers from two universities in Europe have published a method they say is able to correctly re-identify 99.98% of individuals in anonymized data sets with just 15 demographic attributes.
Their model suggests complex data sets of personal information cannot be protected against re-identification by current methods of “anonymizing” data — such as releasing samples (subsets) of the information.
Indeed, the suggestion is that no “anonymized” and released big data set can be considered safe from re-identification — not without strict access controls.
‘Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR [Europe’s General Data Protection Regulation] and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model,’ the researchers from Imperial College London and Belgium’s Université Catholique de Louvain write in the abstract to their paper, which has been published in the journal Nature Communications.
It’s of course by no means the first time data anonymization has been shown to be reversible. One of the researchers behind the paper, Imperial College’s Yves-Alexandre de Montjoye, has demonstrated in previous studies looking at credit card metadata that just four random pieces of information were enough to re-identify 90% of the shoppers as unique individuals, for example.
In another study, which de Montjoye co-authored, that investigated the privacy erosion of smartphone location data, researchers were able to uniquely identify 95% of the individuals in a data set with just four spatio-temporal points.
At the same time, despite such studies that show how easy it can be to pick individuals out of a data soup, “anonymized” consumer data sets such as those traded by brokers for marketing purposes can contain orders of magnitude more attributes per person.
The researchers cite data broker Experian selling Alteryx access to a de-identified data set containing 248 attributes per household for 120 million Americans, for example.
By their models’ measure, essentially none of those households are safe from being re-identified. Yet massive data sets continue being traded, greased with the emollient claim of ‘anonymity’ “
Read the full TechCrunch article here.
The distributed digital IDs combined with self-sovereign identity principles benefit all participants including the user, the organization that wants to authenticate the user, and those organizations that can verify the various claims a user makes about themselves. This is a long term effort, but the idea has already been embraced by IBM, Microsoft, Mastercard, and others and implemented by the province of British Columbia in its Verifiable Organizations Network (VON).
However, even when the individual agrees to release data for analysis, that data should be synthesized as described in this MIT News article. That is, the large dataset that contains any personal data should replace the real data with counterfeit data generated by a machine learning process that assures the counterfeit data remains statistically valid for the research being conducted. At that point all actual personal data can be scrubbed. Companies have already converted this science into practical products.
Overview by Tim Sloane, VP, Payments Innovation at Mercator Advisory Group