Commentary: Scientific communication in the digital age

JUN 01, 2016

Konrad Hinsen

Computers have irreversibly changed the way we do research across all scientific disciplines, and physics is no exception. Those profound changes affect not only the specific subdiscipline of computational physics but traditional experimental and theoretical approaches as well. All modern science employs what I like to call computer-aided research, even when computers are only secondary tools.

Computational physics—the use of numerical methods to solve problems that are too complex or too large for an analytical solution—has grown tremendously since its beginnings in the 1950s. Its starting point was the application of numerical analysis to the well-known foundational equations of classical mechanics and thermodynamics; that application led to techniques such as molecular dynamics simulations. Later, new physical approximations were developed that made sense only in view of numerical solutions. A good example is density functional theory, now a cornerstone of both chemistry and materials science.

In experimental physics, and in astronomy and other neighboring fields, computers have permitted the acquisition and processing of very large data sets. The quantities of raw data produced by experimental setups such as the Large Hadron Collider and the Kepler space observatory could not be handled without the help of computers. Moreover, the use of Bayesian and other advanced methods to infer model parameters from experimental data only became practical with the advent of computers. Today the border between experimental and computational methods is blurring. A protein structure obtained by crystallography, for example, relies as much on accurate theoretical models for the basic chemical structure of amino acids as it does on crystallographic measurements, and the inference procedure used to obtain such a structure borrows simulation methods from computational physics.

Theoretical physics has also been a part of the computational revolution from its early days. Computer algebra systems, which manipulate mathematical expressions and equations rather than numbers, were influenced as much by the needs of theoretical physics as by early research in artificial intelligence. Schoonschip, one of the first computer algebra systems, was developed by Nobel Prize recipient Martinus Veltman for research in particle physics (see Physics Today, December 1999, page 17 ). Such systems make it possible to manipulate incredibly long and complex equations.

What has not kept up with the progress in computing is scientific communication. Its principal vehicle, the journal article, has hardly changed since the first scientific journals appeared in the 17th century. The transition to electronic publishing has merely replaced printed paper with computer screens as the support medium. An article still is a narrative with nothing but tables and figures to communicate a limited amount of data. It cannot accommodate the large data sets from modern experiments nor the complex models implemented in scientific software.

As a consequence, a journal article today generally provides only a summary, rather than a detailed account, of a scientific study. That is a serious problem, because journal articles no longer fulfill their original role in the research process. An article initially enabled other scientists to examine the work critically and build on it in their own future research. Now that an article is generally nothing but a summary, the critical examination is reduced to plausibility checks, and building on the work is limited to using its basic ideas for inspiration.

Comprehensibility

Partial solutions that have developed over time include depositing data sets in public databases and making software source code available for inspection and reuse. Those practices are not yet common in most domains of computer-aided research, but they are gaining momentum. However, making the raw material of research available is only the first step. Publishing is about making our work comprehensible to our peers. Electronic data sets and software source code are not good means of expression for that purpose. What we need is a proper scientific notation for the digital age and software tools that help us explore and understand digital scientific information.

Let me cite an example from my own work. A central technique of computational biophysics is molecular dynamics simulation of biomolecules such as proteins and DNA. Such a simulation is the numerical solution to Newton’s equations of motion for the atoms of a molecular system. The resulting trajectory of atomic positions over time is then used to compute various properties of the system—observable ones that allow validation and nonobservable or difficult to observe quantities for understanding molecular processes. Trajectory analysis makes heavy use of statistical physics—for example, in linking computed correlation functions to spectroscopic observables.

Molecular dynamics simulates conservative classical systems, which are fully defined by a potential energy function. One would thus expect a journal article to provide that potential energy as a starting point. But for a protein, that function has thousands of terms. Printing them as a standard mathematical formula is of no practical use.

The precise list of energy terms for a given system is compiled from the molecular structure by a highly complex algorithm somewhat confusingly called a force field. A journal article could provide the algorithm and the molecular structure, or it could refer to publications that do. Superficially, that’s indeed what is done. But anyone trying to compute a potential energy from a journal article describing a force field will soon notice that the information in the article is woefully incomplete. A force field doesn’t lend itself well to a description in plain English with mathematical formulas. The only complete definition is the source code of simulation software.

Software source code, however, is a bad notation for communication between scientists. The source code of a simulation program mixes the force-field algorithm with the numerical solver for Newton’s equations and with lots of modern computing technicalities: memory management, input/output of large data sets, data exchange between multiple processors, and so on. Moreover, developers value good performance over clarity of the code. Identifying a force field in source code written by someone else is a hopeless endeavor. Many secrets of the trade are therefore known only to the small minority of scientists involved in software development.

The unfortunate consequence is that if two scientists set out to compute the same quantity and obtain different results, finding where the difference comes from is an essentially impossible task. Each scientist has effectively trusted a black-box software tool to do the right thing, without their even knowing what, exactly, that right thing is. In today’s computational biophysics, most scientists have lost control over their models and approximations.

A new notation

A scientific notation for the digital age should make it possible to share, explain, and discuss information items such as biomolecular force fields, which may contain algorithms and large amounts of data. Most scientific models for complex systems fall into that category. Since the volume of information requires computers for processing, the appropriate notation is what computer scientists call a “formal language”—a convention for encoding the information unambiguously as bit sequences. For a more detailed discussion of digital scientific notations, see my recent essay (http://sjscience.org/article?id=527 ).

Such notation is of no use without software tools that let scientists examine and explore the information published by their peers. Experience has shown that understanding a nontrivial algorithm requires both inspecting it and observing its execution. The scientific paper for the digital age therefore would present itself to the reader as an interactive user interface in which data can be examined and plotted, parameters can be changed, and algorithms can be run on different inputs.

Scientists from many disciplines are currently exploring such ideas in projects far too numerous for me to mention here. I have compiled an incomplete list (http://computation-in-science.khinsen.net/notes-chapter-6.html#scientific-communication ) of interesting examples that illustrate the potential of such interactive papers. One important obstacle is that producing those documents remains a very difficult task. One attempt to simplify the process is the Everpub project (http://github.com/everpub ), in which everyone is welcome to participate. The discussions around this open-source project are public and provide a good overview of the many technical but also social issues that must be addressed. I do not know if Everpub will be a success, but I am certain that something like it will become the standard form of scientific communication in the not too distant future.

More about the authors

Konrad Hinsen, (konrad.hinsen@cnrs.fr) Centre de Biophysique Moléculaire, CNRS, Orléans, France.