New tools for winnowing and learning
While the application-oriented side of big data is a common theme in the College of Engineering, UW-Madison engineers also want to make transformative contributions to the mathematical and theoretical underpinnings of analyzing massive data sets. For Electrical and Computer Engineering Associate Professor Rebecca Willett and McFarland-Bascom Professor of Electrical and Computer Engineering Rob Nowak, both Discovery Fellows (and along with colleagues including Industrial and Systems Engineering and Computer Sciences Professor Stephen Wright, also a Discovery Fellow), that effort centers on the intersection of human expertise and computing power. They focus primarily on developing algorithms to winnow down big data sets to their most important elements for human analysis, and on creating algorithms and theory that enable machines to learn from human experts with a minimum of human interaction. This research means confronting the limitations of current data-analysis tools from a computing angle, and also confronting the bottlenecks that human experts create when carrying out their role in the data-analysis process. “There are a certain big-data problems that we don’t yet know how to solve, where we need new tools,” Willett says.
“There are a certain big-data problems that we don’t yet know how to solve, where we need new tools.”
– Rebecca Willett
While they are focused on advancing the fundamental underpinnings of big data research, they do think about the myriad opportunities for applications, in data sets that emerge in settings ranging from astronomers’ satellites to neuroimaging studies. Some of the fundamental questions Willett and Nowak tackle are informed by collaborations, including one with UW-Madison astronomy professor Christy Tremonti, and by their roles with the newly established UW-Madison Center for Predictive Computational Phenotyping, which aims to turn massive amounts of patient data into useful information that could inform treatments and health risk assessments.
On the importance of human input: “Methods that Google and Facebook use to automatically analyze images do not always translate to scientific data,” Willett says. “The features that an astronomer may use to distinguish between two galaxies may be totally different from the features that allow me to distinguish between a cat and a dog. We need easy-to-use methods that incorporate that astronomer’s expertise without it being a burden.”
On the promise and challenges of big data: “Any time you’re trying to do data analysis, there are costs,” Willett says. “But computational costs are decreasing, so we can do more computation with fewer resources—this is not the bottleneck. Rather, the bottleneck is the human expert that can only examine a small fraction of the available data.”
On what the “value of information” means for patient data: “In biomedical research, there may be many experiments you could run, or tons of data for a human analyst to look at,” Nowak says. “The ‘value of information’ is a mathematical framework that can predict what experiment is going to provide the most new information, or help us to decide which data we should ask a scientific expert to look at and interpret.”
On informing theoretical research with practical applications: “Certain things we might want to do mathematically are physically impossible or prohibitively expensive computationally or otherwise, so working with real-world applications grounds the work and also might expose problems, weaknesses, or challenges that haven’t been addressed by our current theoretical understanding of big data problems,” Nowak says.
On the nuances of sifting through patient data: “If you’re trying to analyze images, the data are highly structured,” Willett says. “But an electronic health record is more complex, containing the times at which different lab tests are conducted, pharmaceutical records, images corresponding to different scans, and physicians’ cryptic notes. Unfortunately, these data often contain significant errors, making it very difficult to identify new risk factors or best practices.”
On where “big data” begins and ends: “I like the term ‘found data,’ which is a good term for a lot of the big data we hear about in the news,” Nowak says. “It’s a byproduct of all our online commercial and social activities. In engineering and science, we usually purposely gather data to help us answer specific questions. So big data research in science and engineering naturally involves the data gathering process in addition to data analytics.”