Researchers Search for New Ways to Balance Big Data and Privacy

Imagine you have a database containing personal medical records of thousands — if not hundreds of thousands or even millions — of individuals. How do you go about extracting useful information from this cornucopia of data to develop life-saving treatments while making sure patient records stay private and protected?

Matt Fredrikson
Matt Fredrikson

Matt Fredrikson, a graduate student in WID’s Optimization group and the Department of Computer Sciences at UW-Madison, is part of a team working on solving just this conundrum.

Fredrikson and colleagues worked with a publicly available dataset containing genetic information that could prove to be invaluable in tailoring medical treatment and drug doses for patients. This dataset was used by the International Warfarin Pharmacogenetics Consortium to “train” a machine learning model to predict accurate doses of the drug Warfarin based on demographic and genetic data from patients.

“Machine learning sorts through the training dataset and distills it into a useful and efficient predictive algorithm,” says Fredrikson. Once a machine learning model is trained on a dataset containing personal information, that dataset is no longer needed by anyone using the machine learning system to make predictions.

Machine learning models are used for many different purposes, from figuring out which emails should be sent to your spam folder (think Gmail) to what movies or books you might find interesting (think Netflix or Amazon) to whom you might like to add to your professional network (think LinkedIn).

But questions remain as to whether machine learning models themselves can be vulnerable to malicious attempts aimed at extracting private, personal information, especially in the healthcare domain.

“Could someone predict something sensitive, like genotype, or even whether or not the patient’s medical record says he or she smoked or has diabetes or cancer or any other ailment that might have been part of the original dataset?” Fredrikson asks.

This issue of genetic privacy has been gaining prominence with the many advances in medical and data technologies. Many of the questions surrounding genetic privacy are complicated and extend even beyond the individuals who are part of public datasets.

“For example, if I want to make my genetic data public and I have a sibling who shares probably more than 99 percent of my genetic material, should I also get consent from him (or her)? How about my future unborn kids?” asks Simon Lin, a co-author who was at the Marshfield Clinic at the time of the study.

The dataset used by Fredrikson, Lin and colleagues focuses on individuals prescribed Warfarin, a drug that, when used at the correct doses, reduces the risk of blood clotting and complications such as stroke. But not all patients respond uniformly to the same dose, and depending on their genetic background, the differences can be deadly.

Simon Lin
Simon Lin

To increase the effectiveness and safety of the medication for each patient, health care providers gather additional genetic and personal information, which can be sorted through by machine learning programs to tailor Warfarin dosage for each individual patient.

Fredrikson wanted to know whether attackers who had access to a patient’s Warfarin dose and some demographic information, such as race or age, could use the machine learning model to make predictions about the patient’s genotype even though the machine learning model was no longer accessing the original patient database.

Turns out, they could, often with startling accuracy. And if Fredrikson and his colleagues tried to increase the privacy of the patient data, they ran into a different problem: Warfarin doses predicted by the machine learning program started flying wide of the mark.

“That was the big result,” says Fredrikson. “We couldn’t strike a balance. Once we got to the point where the attacker couldn’t make accurate predictions of the [private] genotypes we were increasing the likelihood of negative health outcomes.”

Fredrikson and colleagues had been using a system called differential privacy to try and protect the machine learning model from malicious attacks. Differential privacy adds a random amount of data ‘noise’ to machine learning model outputs that make it difficult for attackers to use the models to predict private information (such as genotype in this case). Of course, adding random noise to the output also affects the accuracy of the machine learning model predictions.

“Ideally, what you would want is no increase in the likelihood of adverse health outcomes but a significant reduction in the ability of the attacker to guess genotype,” explains Fredrikson. But as the researchers increased the amount of random noise the machine learning model was adding to predicted Warfarin doses to better protect privacy, the accuracy of the predictions suffered terribly.

“Ideally, what you would want is no increase in the likelihood of adverse health outcomes but a significant reduction in the ability of the attacker to guess genotype.”

— Matt Fredrikson

Tom Ristenpart, a Discovery Fellow in the WID Optimization research area, and another co-author, says this study is one of the first ones to truly delve into the repercussions of using differential privacy and machine learning models in a clinical setting.

Thomas Ristenpart
Thomas Ristenpart

This research is important, Ristenpart says, because it allows the good guys to stay ahead of the game and plan how to eliminate security threats such as privacy attacks.

“A lot of what we security researchers are concerned with is getting ahead of attacks in the real world and understanding new threats with new technology trends,” he says.

These machine learning models are not currently in use in a clinical setting, and the findings of this paper allow scientists to plan ahead.

“I think we are far enough on the leading edge that we are in a good position that the public doesn’t need to worry” explains Ristenpart.

Researchers continue to work on new privacy algorithms that will strike a sweet-spot between utility and privacy. But Ristenpart points out that raising awareness about privacy issues, regulating who has access to critical machine learning models and even developing ways to detect malicious attacks could all help pave the way for more stringent privacy protections without sacrificing the accuracy of the machine learning models.

“There are many different potential counter measures, and there’s a lot of work to be done to see which the more viable ones are” he says.

— Adityarup Chakravorty