Building Better Tools for the Big Data Boom

A homicide in south Chicago sparks retaliatory criminal activity in west Chicago. A firing neuron lights up a web of neurons in another brain region. A tweet from a Russian bot sets a storm of conversations rippling through the Twitterverse.

These types of interconnected events have been a subject of study by mathematicians for decades, but in the era of Big Data, researchers like electrical and computer engineer Rebecca Willett and statistics professor Garvesh Raskutti, both part of the Institute for Foundations of Data Science at the University of Wisconsin-Madison, are developing new methods and tools to make sense of how discreet events can influence the occurrence of other events over time.

Such systems, studied by mathematicians since the 1970’s, are called self-exciting point processes – self-exciting because events can stimulate or inhibit future events. Interactions between different events can often be modeled using networks. For instance, different neurons in a brain would be represented by network nodes, and two nodes would be connected if activity at one node influences activity at the other. Researchers don’t observe the network itself, but they can draw conclusions about the network based on the activity levels of different nodes. The challenge, according to Willett, is making meaningful inferences about these influences for increasingly large networks.

“When you’re talking about Twitter or Facebook or neurons in a brain… the classical methods used to analyze these data don’t hold up.”

Events associated with each node (i.e., person) in this network influence the likelihood of events for each other node. The underlying influences among the nodes are unknown — estimating them is the focus of Willett and Raskutti’s work.

“A lot of the prior work has been focused on situations where you’re only looking at very small networks – maybe three or four different interacting entities – but when you’re talking about Twitter or Facebook or neurons in a brain, it’s much, much bigger than that,” says Willett. “The classical methods used to analyze these data don’t hold up.”

Willett and Raskutti were recently awarded grants from the National Geospatial-Intelligence Agency and the defense department’s Army Research Office to work on the foundational mathematics, algorithms, and statistics of point processes. “These agencies are interested in developing new tools to better understand mechanisms underlying much of their data,” says Raskutti. The tools they develop, test, and refine will have broad applications beyond the interests of intelligence and security agencies, as demonstrated by the real-world data they are using to create and test their models: social and biological networks including everything from crime data to brain data to social media.

Even though the data used to generate this map of estimated criminal influences among neighborhoods in Chicago contain no geospatial information, geographical patterns emerge, suggesting the estimates are valid.

“One application that we’re currently working on is how crime spreads,” says Raskutti. He and Willett have obtained a large dataset about crime in Chicago that includes information about the time, location, and nature of thousands of past gun crimes that occurred throughout the city, many of which are linked in some way. If the network of crimes can be understood, says Raskutti, that information can be used to improve the efficacy of law enforcement efforts. “Given a lot of data about interrelated crimes in the past, can you predict how crime will spread in the future by having a better idea of the network?”

Willett and Raskutti are focused not only on developing new algorithms but also on providing guarantees on the accuracy of the proposed algorithms. For instance, practitioners often need to know how many events they must observe to be able to get an accurate estimate of the underlying network or how the complexity of the underlying network impacts algorithm performance. Willett, Raskutti, and graduate student Benjamin Mark are developing novel mathematical and statistical theories that provide practitioners with these insights.

For example, neuroscientists may wish to understand how neuronal activity changes as a mouse engages in different activities: sleeping, eating, exploring a maze, etc. But first, they must ensure that the duration of each of these different activities is sufficiently long – with enough observations of neurons firing – to ensure accurate estimation of the underlying network is possible.

Willett and Raskutti are confident that the tools they develop will be portable across all kinds of complex point process networks. Social networks and financial markets, for example, can be characterized the same way as crime networks or neurons. One person tweeting or trading influences connected nodes – other Twitter users or traders – developing systems that can be described by Willett and Raskutti’s algorithms.

By engaging in fundamental mathematical and statistical research to support the development, testing, and fine-tuning of tools for the future, Willett and Garvesh are finding new ways to make sense of the mountains of data that are available in the 21st century and bringing into view important applications on the horizon.

This work is supported by NGA grant #HM0210-14-BAA-0001 and ARO grant #W911NF-12-R-0012.

— Nolan Lendved