Emphasis on Data Wrangling Brings Materials Science Up to Speed
April 26, 2016 Nicole Hemsoth
In 2011, the United States launched a multi-agency effort to discover, develop, and produce advanced materials under the Materials Genome Initiative as part of an overall push to get out from under the 20-year process typically involved with researching a new material and bringing it to market.
At roughly the same time, the government was investing in other technology-driven initiatives to bolster competitiveness, with particular emphasis on manufacturing. While key areas in research were developing incredibly rapidly, it appeared that manufacturing, materials, and other more concrete physical problems were waiting on better solutions while genomics, nanotechnology, and other areas were booming.
Initial funding went to the Department of Energy to build the computational tools required ($12 million) and the Department of Defense as an R&D arm for the prediction and optimization of materials properties, as well as countless universities and institutions. Five years later, the computational side of these efforts are bearing fruits, leading to new materials for airplanes and a large range of consumer and industrial products—but the road has not been free from barriers, particularly for large-scale data availability for tooling to develop.
The arrival of this “fourth paradigm” of science, which is data-driven discovery, lagged behind in materials science compared to other areas of research, particularly bioinformatics, says Dr. Ankit Agrawal, a research professor at Northwestern University focused on blending high performance computing and data mining tools. Materials science had catching up to do. Although the supercomputing and simulation sides of the problem were well understood, particularly in molecular dynamics and other areas, the materials science barrier was more grounded. The data needed to be accessible before the field could progress.
A great deal of what kept materials science from developing at the same rate as bioinformatics, particularly five years ago, was the lack of access to open data, Agrawal says. Whereas the National Institutes of Health and other agencies released data for scientific research, no such wellspring existed for materials science. The high performance computing-based simulation side of materials progressed along the lines of available applications and hardware, but the all-important experimental phase, where hard-won data was collected and operated upon, was lost behind lock and key.
Compared to bioinformatics, which had vast volumes of open data several years ago, materials science was far behind in creating domain-specific tools for the experimental phase of materials science discovery workloads. It is only relatively recently that a new offshoot of materials science, called materials informatics, has sprung up to solve those problems.
“The experimental part of materials science is much more difficult and takes a great deal of time,” Agrawal explains. By applying data mining tools to that for tailored experiments can reduce time to result significantly, he says, pointing to how teams at Northwestern worked to develop a new steel alloy by applying data mining approaches new to the field using forward approaches (property prediction) and inverse models (materials discovery).
The efforts in materials informatics including “combining experimental and simulation data into a searchable materials data infrastructure and encouraging researchers to make their data available to the community,” Agrawal says. “Thanks to such efforts, it is fair to say that the sheer complexity and variety in materials science data becoming available nowadays requires the development of new big data approaches in materials.”
The data problem in materials informatics is not necessarily rooted in size. Most experiments have relatively small data sets. It is rather the complexity and value of the data that necessities a tailored materials analytics platform.
Agrawal says he expects to see more robust toolsets emerging that feed off the momentum of data analytics in other areas. Hadoop and MapReduce, Spark, and other data handling frameworks might be a bit late to the materials science party, but the game of catch-up they play will be to one to watch, especially as the analytics side gets a boost from hardware reserved for HPC, including GPUs and Xeon Phi, he says.