Privacy by design: a framework for design in the age of big data

‘How does Big Data affect personal privacy?’ and ‘in what specific way are privacy and big data connected?’ are two questions we are exploring in our research on big data and privacy. Another question is about a possible way out. How can we organize privacy in the age of Big Data?

One part of the privacy issue related to big data is re-identification. As more data, from more sources, assembles around a single individual, despite de-identification efforts, it becomes easier to eventually re-identify a specific individual. Traditional methods for de-identification, such as anonymization, pseudonymization, encryption, key-coding, data sharding, are more and more becoming less effective. “Re-identification science disrupts the privacy policy landscape by undermining the faith that we have placed in anonymization“, according to Paul Ohm, privacy expert at the University of Colorado.

One interesting take on this matter is presented by Jeff Jonas,  Chief Scientist of the Entity Analytic Solutions-group and an IBM Fellow, in a paper called Privacy By Design (in collaboration with Ann Cavoukian). He presents an ‘anonymous resolution’ decreasing the risk of re-identification based on 7 design principals:

FULL ATTRIBUTION: Every observation (record) needs to know from where it came and when. There cannot be merge/purge data survivorship processing whereby some observations or fields are discarded

DATA TETHERING: Adds, changes and deletes occurring in systems of record must be accounted for, in real time, in sub-seconds

ANALYTICS ON ANONYMIZED DATA: The ability to perform advanced analytics (including some fuzzy matching) over cryptographically altered data means organizations can anonymize more data before information sharing

TAMPER-RESISTANT AUDIT LOGS: Every user search should be logged in a tamper-resistant manner — even the database administrator should not be able to alter the evidence contained in this audit log.

FALSE NEGATIVE FAVORING METHODS: The capability to more strongly favor false negatives is of critical importance in systems that could be used to affect someone’s civil liberties.

SELF-CORRECTING FALSE POSITIVES: With every new data point presented, prior assertions are re-evaluated to ensure they are still correct, and if no longer correct, these earlier assertions can often be repaired —in real time.

INFORMATION TRANSFER ACCOUNTING: Every secondary transfer of data, whether to human eyeball or a tertiary system, can be recorded to allow stakeholders (e.g., data custodians or the consumers themselves) to understand how their data is flowing.

While this framework is reducing the risks and not completely solving the issue (which is impossible), I think privacy needs to be adressed from the start. So the design/architecture stage of systems is a proactive approach to addressing the issue.  Building in privacy-enhancing elements by design can minimize the privacy harm and in some cases might take away possible harm in the first place. How do you feel about organizing privacy in the age of big data? And what about organizing privacy by design?

Read the paper.


Leave a Reply