A handful of middle-aged baseball scouts sit around a conference table to discuss possible recruits for the Oakland Athletics. The room is quiet, their breathing barely audible. One scout spits tobacco into a cup. They’re patiently waiting for Manager Billy Beane to pick a player. Finally, he offers up three names — but none of the scouts are impressed by their reputations. The gut feeling isn’t there.
But the numbers are. An analyst in the corner pipes up that three ballplayers with a 364 on-base average will improve the team’s chance at winning the most consecutive games. Most of the scouts throw their hands up in the air frustrated.
Each year, Shannon McKeen shows this scene from the film “Money Ball” during a UNC business class he teaches. The exchange between Beane (Brad Pitt), the scouts, and this analyst (Jonah Hill), according to McKeen, is representative of what’s happening in companies across the United States — data is driving business decisions. “There are people who have been in their jobs for a long time,” he says. “They’re successful and smart and have great instincts and a lot of experience. But you have analysts using data — sometimes in support of the gut feelings and sometimes not — and when to go with the gut versus the data is a big challenge.”
McKeen and David Knowles lead the Renaissance Computing Institute’s (RENCI) National Consortium for Data Science (NCDS), a collaboration of leaders in academia, industry, and government that addresses 21st-century data challenges. “It’s a thought leadership forum,” Knowles shares, “intended to engage private sector companies, research universities, and government agencies — all of whom are struggling with this emerging field of data science. It’s really designed to create a safe space for them to come together in a non-competitive atmosphere and either work on issues or support one another in terms of sharing ideas, concepts, events, and challenges across these sectors.”
Data is “the new oil,” according to Knowles, who leads NCDS’ program development and operations activites. It’s a fuel that drives economic development and scientific discovery. It’s everywhere and it has tremendous value locked inside it. And organizations from small business to large companies to universities are wondering how they can manage it. “Data is pervasive in everything that businesses do,” Knowles says, “from customer demographics, to relevant markets, to profit margins, to inventory management.”
RENCI views the consortium as an extension of its role to serve as a research accelerator for communities of scientists across institutional boundaries. Today, NCDS brings together researchers and businesspeople alike from a variety of institutions including IBM, General Electric, Drexel University, NC State, the Environmental Protection Agency, and RTI International, to name a few. Throughout the year, the consortium hosts a variety of conferences, a short course series, and the Data Fellows Program — an early career seed grant program for junior faculty conducting industry informed research.
One of the 2016-17 Data Fellows, computer science professor Shahriar Nirjon, focuses his research on a hot topic in the data world: the Internet of Things, the ability of everyday objects to send and receive information. An example of this is Google’s Nest Thermostat. This smart device can learn the routine of the person(s) near it, adjusting the heating and cooling based on things like how many times the front door is opened daily. A major concern with such technology involves a lack of security measures in terms of exposing personal data. “All of a sudden, whether you’re home or not becomes a technical security issue,” Knowles says. “There’s a real, potential risk that this data could be accessed by others for purposes that are not in your interest as the homeowner.”
While security is a big issue for some, the economic value of such devices is a draw for others. One local startup, for example, has developed a grain sensor that tracks the moisture within a silo. The data collected tells farmers when the grain will spoil and helps them decide which silos to empty first.
Agriculture — another traditionally gut-driven industry — could benefit greatly from data science, according to McKeen. “At a conference last summer,” he says, “one presenter touched on using data and GPS to run tractors, to decide where to put fertilizer, and how to get more yield from crops to help lower the price of food. He wants to solve world hunger using data to make the land more efficient — not just in the United States but globally.” Every industry uses data, but the big players are financial services, health care, utilities, and manufacturing.
“We spend a lot of time in academia, industry, and government, and there’s a lot of white space there to make connections,” McKeen says. “Those are the wins for NCDS — the people we’ve brought together, whether a student at a career panel who ends up working at a media company, or a data fellows researcher who ends up partnering with a research enterprise.”
Storing the power
There’s power in the accessibility of data, but it’s lost without some type of system to organize it all. “Humans are good at seeing patterns. We’re good at analyzing 10 pieces of data or 100 pieces of data,” McKeen says. “Because of all these sensors and the lowering cost of infrastructure, now it’s millions of pieces of data. We just can’t process it.”
RENCI already has a solution for that: iRODS. Short for Integrated Rule-Oriented Data System, iRODS is an open source data management software that allows research organizations and government agencies to store their data, attach metadata to it, automate policies for managing it, and make it discoverable.
“It allows infrastructure to continue to shift underneath without having to rewrite all of the applications above,” Jason Coposky explains. “And because it’s open-source, iRODS is free to use — that removes a large burden financially for many institutions that probably wouldn’t interact together.”
Coposky leads the iRODS Consortium, a collaboration similar to NCDS that offers support to users. Some of the software’s biggest clients include life science and genomics institutes like the Wellcome Trust Sanger Institute, Complete Genomics, Bayer, and even NASA. Researchers in these organizations can actually transfer the data observed by scientific instruments directly into iRODS.
In truth, the software is for any organization that has a large amount of data to manage, according to Coposky. So far, the consortium’s other members include universities, information technology businesses, and data storage companies.
Stoking the fire
In 2015, the National Science Foundation (NSF) established four regional hubs to accelerate the emerging field of “big data” science. Today, UNC and the Georgia Institute of Technology manage the southern region of that initiative, the South Big Data Hub (South BD Hub). Each hub focuses on different “spokes” to address regional challenges. South BD Hub has five: health care, coastal hazards, industrial big data, materials and manufacturing, and habitat planning.
Like NCDS, the goal is to establish partnerships. “Data is a warming fire where people gather and converse,” Stanley Ahalt, South BD Hub principal investigator and RENCI director, says. “We want to create a data observatory — collections of data communities to carry out scientific endeavors. There is an immediate awakening to the fact that, increasingly, all of our scientific progress will depend on scientists and researchers utilizing data effectively and efficiently in order to carry out research.”
Industry collaboration will play a large role in the communities Ahalt references. In fact, NSF has publicly stated that it would like all its data hubs to be a place where public-private partnerships can form. Such relationships began developing this summer with the creation of the Southern Startup Internship Program in Data Science (DataStart) — the brainchild of RENCI’s Knowles. Funded by the Computing Community Consortium, the program paired six graduate students from universities throughout the South with new and growing technology companies. For three months, each student helped their assigned startup address data-related business problems.
“We want to help students understand what a data science career is,” Knowles says. “It doesn’t mean they have to become statisticians. We want them to understand the types of opportunities that are available to them.”
Fueling the future
When groups share and combine their data, they unlock new knowledge, according to Ahalt. The North Carolina Translational & Clinical Sciences Institute’s (NC TraCS) Carolinas Collaborative, for example, gives a group of North Carolina and South Carolina health care systems access to each other’s data. This allows them to do things like compare the number of patients with a particular disease and the medications their taking — information that can advance research and improve the quality of health care.
“We can also take data from sensors on the ground to record how fast it’s raining in Eastern North Carolina and combine that with data from satellites and airborne vehicles to identify how significant storm surges would be in coastal regions,” Ahalt explains. This information could then be integrated into UNC-developed ADCIRC, a powerful computer model for predicting the response of the coastal ocean to tides and storms.
“These consortia are a novel mechanism here at UNC,” Knowles points out. “There are not a lot of these public-private partnerships — and we know this because we looked. There’s no playbook on the shelf for us to follow.”
Data itself is not new, Ahalt stresses, but it is abundant. “Its abundance now provides society with an opportunity to use data as a fuel to drive economic development and scientific discovery,” he explains. “It’s a very unusual world we live in right now. We have lived in periods throughout history in which material was what made wealth and, now, immaterial things like digital information creates wealth and discovery. The possibilities are endless.”