White House seeks to get a handle on “big data”
DOI: 10.1063/PT.3.1555
Five federal science and technology agencies announced plans to spend more than $200 million in total to develop new tools and techniques to process and analyze huge volumes of digital data. The initial cadre of “big data” R&D participants are the Department of Energy, NSF, the Department of Defense and its Defense Advanced Research Projects Agency (DARPA), the National Institutes of Health, and the US Geological Survey (USGS).
Presidential science adviser John Holdren said the initiative, announced on 29 March, responds to criticism from the President’s Council of Advisors on Science and Technology that the government has been underinvesting in technologies needed to collect, store, preserve, manage, analyze, and share large quantities of data. The world is now generating 1021 bytes of data each year, and the volume is growing rapidly, Holdren said. The data are generated from such diverse sources as remote sensors, online retail transactions, text messages, email, video messages, computers running large-scale simulations, and scientific instruments, including particle accelerators and telescopes. Big data, Holdren said, “are critical to accelerating the pace of discovery in many different domains of science and engineering.”
William Brinkman, director of DOE’s Office of Science, said experiments at the Large Hadron Collider generate terabytes of data each second, and a climate-model simulation produces 10 terabytes a day. As part of the data initiative, DOE announced a $25 million, four-year award to a national laboratory–university consortium led by Lawrence Berkeley National Laboratory. The goal is to establish a scalable data management, analysis, and visualization institute to assist researchers in using the latest software tools to analyze the data generated by the labs’ high-performance computers.
A joint solicitation announced by NSF and NIH is aimed at advancing the core scientific and technological means for managing, analyzing, visualizing, and extracting useful information from large and diverse data sets. Grants will be awarded for research on new algorithms, statistical methods, technologies and tools for improved data collection and management, data analysis, and e-science collaboration environments. NSF also announced a $10 million grant to the University of California, Berkeley, researchers who are developing novel data-center programming models, improved computational infrastructure, and new scalable machine-learning algorithms and data management tools for handling large-scale heterogeneous data sets. The NSF National Center for Supercomputing Applications, at the University of Illinois at Urbana-Champaign, is home to one of the most powerful supercomputers in the world, and it just granted access to researchers a few weeks ago, noted NSF director Subra Suresh.
The NIH announced that it is making its 1000 Genomes Project, the world’s largest collection of data on human genetic variation, available for free on a cloud operated by Amazon.com. At 200 terabytes, that data printed out would fill 16 million file cabinets, said NIH director Francis Collins. The project has collected a data set so large that few researchers have the computing power needed to make use of it.
The Pentagon “is placing a big bet on big data,” said Zachary Lemnios, assistant secretary of defense research and engineering. The $60 million of new research funding just announced brings DOD spending on big data R&D to $250 million annually. Some of the funding will be devoted to open prize competitions (see PHYSICS TODAY,November 2010, page 21
“We are drowning in data but starving for understanding,” said USGS director Marcia McNutt. The agency’s John Wesley Powell Center for Analysis and Synthesis announced the award of eight new research projects for transforming big data sets and big ideas about Earth science theories into scientific discoveries.
Holdren promised that other federal agencies would have additional announcements about big data in the months ahead. He invited industry and universities to participate and said the effort is “not something the government can or wants to do by itself.”
A White House fact sheet (http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final_1.pdf

The evolution of Hurricane Katrina. This simulation was generated by researchers in the Advanced Visualization Laboratory at the NSF-funded National Center for Supercomputing Applications. The AVL team transformed terabytes of data into an animation of the 36-hour period when the storm gained energy over the warm waters of the Gulf of Mexico and headed toward New Orleans.
NSF

More about the Authors
David Kramer. dkramer@aip.org