Computational compound screening of biomolecules and soft materials by molecular simulations

Decades of hardware, methodological, and algorithmic development have propelled molecular dynamics (MD) simulations to the forefront of materials-modeling techniques, bridging the gap between electronic-structure theory and continuum methods. The physics-based approach makes MD appropriate to study emergent phenomena, but simultaneously incurs significant computational investment. This topical review explores the use of MD outside the scope of individual systems, but rather considering many compounds. Such an in silico screening approach makes MD amenable to establishing coveted structure--property relationships. We specifically focus on biomolecules and soft materials, characterized by the significant role of entropic contributions and heterogeneous systems and scales. An account of the state of the art for the implementation of an MD-based screening paradigm is described, including automated force-field parametrization, system preparation, and efficient sampling across both conformation and composition. Emphasis is placed on machine-learning methods to enable MD-based screening. The resulting framework enables the generation of compound--property databases and the use of advanced statistical modeling to gather insight. The review further summarizes a number of relevant applications.


Introduction
Ceder and Persson's Scientific American article The Stuff of Dreams refers to the "golden age of materials design," a new era where computational methods-a mix of hardware and software implementation of physical laws and equations-assist scientists in designing new functional materials [1]. Designing better materials means selecting a chemical composition that yields superior materials properties. Traditional avenues have followed an Edisonian, trial-and-error approach, by experimentally screening as many compounds as possible-an approach that is typically both time-consuming and costly, due in no small part to synthesis, processing, and characterization. Computation offers a parallel route to search for compounds with desired characteristics, where the numerical solution of fundamental equations (e.g., the Schrödinger equation) can make predictions before going to the laboratory. The effort has gained momentum thanks to the development of computational hardware, software, and database tools, demonstrating exceptional potential to accelerate materials discovery in various fields [2,3,4,5,6,7].
There are good reasons to expand compound screening beyond the experimental realm. While high-throughput screening can probe impressive numbers of candidates, the requirements to synthesize, process, and/or characterize large libraries of compounds typically restricts the approach to particular systems and properties [8,9,10,11,12,13]. The computational route certainly also holds its share of system and property limitations, but are alleviated by the variety in resolutions, methods, and algorithms. Limitations may also arise from the set of compounds accessible: synthesized drugs form a minuscule subset of the chemical space of small organic molecules [14]. While not all compounds are expected to be necessary to satisfyingly interpolate the space, the level of subsampling unfortunately leads to a lack of uniformity: a database bias [15,16]. Screening on the computer, on the other hand, needs no synthesis-though its virtual analog, model parametrization, often remains a challenge. More flexibility in choosing compounds enables avenues to exhaustively enumerate small subsets [17], find efficient ways to build up combinatorics [18], and select compounds using more sophisticated strategies, for instance active learning [19].
To remain robust across chemical space, computational methods must rely on fundamental, broadly applicable physical laws and equations. These physics-based methods-including the Schrödinger and Kohn-Sham equations at the electronicstructure level and Newton's classical equations of motion at the classical levelcan make predictions that are grounded in the corresponding physics. Even classical simulations typically give rise to significant computational costs, which had until recently limited their penetration into the field of compound screening. Turning to density functional theory (DFT), the recent development yet rapid adoption of high-throughput schemes for various materials applications testifies to the escalating role of computation in materials screening and discovery [3,20,21,22].
While some fields have already benefitted strongly from computational screening, others lag behind-such is the case for soft condensed matter. Marked by weak characteristic interaction energies on par with thermal energy, k B T , soft-matter systems embody a large class of materials, including not only polymers, liquid crystals, surfactants, colloids, but also biomolecules. When coupled to thermal fluctuations, soft matter display fascinating phenomena, such as spontaneous self assembly and mesoscopic architectures, simply navigating a rich free-energy landscape [23].
Fluctuations de facto require a careful consideration of entropic effects, and adequate computational methods to sample the accessible conformational space. Furthermore, soft-matter systems also typically display poor scale separation, challenging multiscalemodeling approaches [24]. The challenges of modeling biomolecules and soft matter have largely kept the field in a "craftsmanship era." Scientific studies focus on one or a handful of compounds, due to difficulties in parametrizing, preparing, sampling, and analyzing the system. These aspects all stand orthogonal to a screening strategy-automation reigns over the highthroughput paradigm. It is thus no surprise that machine learning and other data-driven techniques are rapidly penetrating the field of soft materials [25,26,27]. The rapid rise of high-throughput molecular simulations is the topic of this review.

Scope
Compound screening is a vast, quickly evolving area that connects to physics and chemistry, materials science, and even branches out to a plethora of applications, from organic photovoltaics to electrocatalysis to drug discovery to biomaterials [28,29,30,31,11]. Despite its focus on biomolecular systems and soft matter, this compoundscreening review will exclude studies originating from experimental data-arguably its largest subset. A large body of work has been devoted to the utilization of experimental compound databases, notably from quantitative structure-activity relationship (QSAR) methods in drug discovery [32,33,34]. Instead this review will focus not only on computational (in silico) screening, but those generated from physics-based methods. Physics-based methods consist of a hierarchy of multiscale-modeling methods, from quantum chemistry, to empirical force-field-based molecular dynamics (MD), to particlebased coarse-grained (CG) simulations, to continuum modeling [24,35,36]. They prevail in some key aspects essential to biomolecular materials and soft matter, specifically the modeling of emergent phenomena and entropy. Further, this hierarchy offers a conceptual bridge to the funnel-like nature of compound screening: quickly screen with fast methods and refine with more accurate models.
Current computational limitations strongly limit a purely quantum-chemical approach to a limited range of problems: primarily isolated molecules or relatively small and homogeneous environments [37]. Classical MD simulations prevail for biomolecules and soft matter, because of their ability to efficiently sample the vast conformational space. For a history and overview of MD simulations, we refer the reader to excellent books and reviews [38,39,40,41,42]. Though MD-based screening studies are dominated by an atomistic resolution, CG models take an increasingly large role, thanks to their more favorable computational load and ongoing improvements in linking the lower resolution to the underlying chemistry. This review will mostly revolve around spatial CG, while coarse-graining in time will largely be left out due to (so far) limited impact on compound screening [43,44,45].

Inverse problems in soft matter
A material, entirely determined by its chemical composition-but also often its processing-will yield specific properties. Making measurements, either by experimental techniques or analytical/numerical calculations, boils down to establishing a mapping between the material composition and its properties. This is commonly denoted the forward problem, and is illustrated in Fig. 1a [46]. Materials design, on the other hand, aims at establishing the backward-or inverse-mapping: identifying the adequate structure given properties of interest. While the forward route is straightforward, there is no experiment or equations of motion to directly probe the backward problem. It instead typically requires solving an inverse problem: from a (small) number of forward measurements, infer the function that links chemistry to materials property. The notorious difficulty to solve inverse problems also applies in materials discovery, and leads to strenuous requirements on the number of measurements compared to the size of the interpolation space [47]. Though commonly referred to as structure-property relationships, this terminology hides that the structure itself is entirely determined by the material's chemical constituents. The review by Sherman et al. clearly differentiates four different stages in the design of soft materials: (i) chemical synthesis or preparation leads to (ii) building blocks with effective, coarse-grained interactions, which drive their assembly into (iii) structures or morphologies, and imprint (iv) properties on the macroscopic scale [48]. This chemistry-building-block-structure-property framework does justice to the complexity, heterogeneity, and large scale separation that characterizes soft matter.
The chemistry to building-block step, (i → ii), is essential to reduce the overwhelming vastness of chemical space [14,49] into a low-dimensional set of effective components with coarse-grained interactions. This requires a thorough understanding of the dominant driving forces: supramolecular interactions such as van der Waals, electrostatics, or hydrogen bonds [50]. Modeling has greatly taken advantage of building blocks by means of top-down coarse-graining, which parametrize simple models based on key phenomenological interactions, while staying close to the chemistry [51,52]. The building-block to structure step, (ii → iii), has likely received the most attention. Relevant work largely consists of improving our understanding or finding practical routes at linking coarse-grained interactions to self assembly. Notable examples include the directed self assembly of diblock copolymer thin films using self-consistent field theory [53]; The "materials design engine," using statistical mechanics as an automatic optimizer, with applications including the folding of a polymer and the directed self assembly of block copolymers [54]; Design principles for colloidal self assembly with short-range interactions, establishing tight restrictions on the relative strength of the favorable and unfavorable interactions, as well as the number of components and energies [55]; A "digital alchemy" framework to control self assembly by optimizing building blocks for a given target bulk structure [56]. The structure to property step, (iii → iv), has largely involved finite-element methods to optimize material microstructures for specific design specifications, such as acoustic, elastic, and photovoltaic properties [57].
At equilibrium an additional consideration may prove useful in approaching inverse problems: the free-energy landscape. Central to any soft-matter system, the free-energy landscape shapes the self-assembly route, navigating down between conformational basins toward a (local) minimum. The free-energy landscape also conditions all observables, by its statistical weights over the conformational space. In the context of solving the inverse problem, the free-energy landscape thus stands as a powerful, physically meaningful intermediary between chemistry and building-block constituents on the one hand and structure/morphology and macroscopic properties on the other.
How does changing the chemistry affect the free-energy landscape? Various studies are tackling this question. Meng et al. reported the free-energy landscape of clusters of attractive hard spheres, including a detailed characterization of the rotational entropy [58]. Scaling up, the field of protein folding has led to great insight into how the shape of the free-energy landscape impacts a protein's properties-the famous funnel-like shape is characteristic of many efficient folders [59,60,61]. These developments further enabled the design of new proteins, whose sequence and structure differ significantly from naturally occurring proteins [62]. Unfortunately not all free-energy landscapes display straightforward shapes; self assembly often results from a competition between conformational basins. Jankowski and Glotzer carefully studied the assembly pathway of patchy particles to grasp the diversity of possible final structures [63].
Coarse-graining likely has a strong role to play in the context of screening. As described below in Section 3.8, a high-throughput study of drug-membrane thermodynamics linked coarse-grained features of small molecules with their potential of mean force of insertion in a lipid membrane [64]. The results suggest that exploring the diversity of top-down CG building blocks (step ii) fittingly simplified the structureproperty relationship, making it easier to identify. CG models evidently coarsen the underlying free-energy landscape, and what could be criticized as a loss in accuracy or resolution can also be seen as a decisive advantage to tackle the inverse problem.
The system-size limitations associated with MD simulations naturally hinder the prospects of scaling up to genuine macroscopic properties. The systems remain instead micro-to mesoscopic and focus on basic structural, thermodynamic, and sometimes dynamical aspects. Their particle-based nature also naturally lend themselves to starting from the (i) chemistry or (ii) building-block steps.

Data-scales
One landmark property of most-if not all-materials is the large dynamic range of relevant length-and time-scales. Microscopic interactions lead to mesoscale architectures and morphologies, but also conformational transitions and aging behavior. It is not uncommon to observe phenomena spanning 10 or more orders of magnitude for either scale: from sub-nanometer to meter, and from femtosecond to seconds or more. Interestingly, these scales are relevant not only to understand the intrinsic properties of the system, but also to probe it: both experimental techniques and computational methods typically specialize in probing a (possibly small) subset of these scales [24,35,65]. For instance, quantum-chemical methods reign at small length-and time-scales, but fall short much beyond the nanometer-and picosecond-marks.
In this review we apply a similar conceptual framework to the number of screened compounds, N compounds . This data-scale, unlike its other two counterparts, is not an intrinsic variable-it is merely a practical consideration to help guide both the forwardmeasurement and backward-inference processes. We refer the reader to Figure 1 for an illustration: establishing structure-property relationships (panel a) hinges upon the number of compounds screened (panel b). As will be described in Section 3, MD studies typically work in the range 1 ≤ N compounds 10 6 , though steady progress will likely rapidly push the upper bound. Working in higher regimes of the data scale will on the one hand strongly impact requirements on the forward-measurement protocol (Figure 1c), but on the other hand permit more sophisticated statistical-analysis techniques ( Figure 1d). The data scale thereby forms an essential pillar to guide a compound-screening study, both to generate a database and garner insight from it.

Computational high-throughput paradigm
Before moving onto applications (Section 3), we first describe the forward-measurement requirements and backward-analysis possibilities that a computational high-throughput paradigm both impose and enable, sketched in Figure 2. The forward-measurement steps necessary to build the compound database-the blue boxes in Figure 2-embody the computational analog of a laboratory's high-throughput screening experiment. The framework demands a strict and homogeneous protocol across compounds for two reasons: (i) it yields a consistent database amenable to extracting structure-property relationships; and (ii) it is practically convenient for automation purposes. The present section describes the various aspects of running MD simulations under these constraints. When possible, the examples will be borrowed from the biomolecular and softmatter fields. In other cases however, examples from other fields-in particular chemistry and hard condensed matter-may prove insightful of where developments may be headed.

Force-field parametrization
The scope and level of refinement of a number of biomolecular force fields attest to the remarkable developments in the molecular-simulation field: some of them are decades in the making, amounting to thousands of finely tuned parameters, and have endured relentless evaluations [66,67,68,69,70,71]. Unlike more empirical methods (e.g., statistical scoring in drug discovery), the physics-based nature of force fields grounds the model in the physics considered. It relies on specific potentials that encode relevant interactions [72,73,74]. Unfortunately force fields are difficult beasts to tame: their complexity makes any (re)parametrization for new compounds laborious and do not always offer systematic strategies.
Automated force-field parametrization is an old idea that is difficult to practically implement. Why is that? Quantum mechanics ought to provide us with a sure-fire way to derive classical potentials. Unfortunately the physics encoded in force fields is rather limited: for instance, most force fields are not explicitly polarizable. The limited physics of the model clouds the relationship to quantum mechanics and instead warrants a parametrization based on experimental properties. Major biomolecular force fields, such as CHARMM and OPLS, typically use a combination of reference information to parametrize across the chemical compound space (CCS; more on that in Section 2.3.2) of drug-like small molecules: like others the CHARMM general force field (CGenFF) uses quantum mechanics to optimize charges and bonded interactions, while Lennard-Jones parameters rely on experimentally determined liquid density and heat of vaporization [75]. The gradual incorporation of model compounds allows CGenFF to broadly interpolate across a large subset of CCS, while retaining high fidelity of structural and thermodynamic properties. A similar strategy has been applied by OPLS [76,77], GROMOS [78], and AMBER [79].
Arguably the incorporation of experimental data in a computational-screening pipeline is unfortunate: experimental data is limited to a minuscule subset of CCS, and it might well defeat the purpose of a virtual compound-discovery study. Despite their broad coverage of CCS, the above-mentioned biomolecular force fields largely avoid this issue by sharing and reusing information between molecules. The piece of information that is typically shared is the atom type. Beyond the chemical element itself, it represents the atom in a molecule given a local environment, for instance an sp 2 carbon in an alkene. The more chemically specific, the better-in other words the larger incorporation of neighboring atoms will more precisely characterize the local environment, and offer all the more resolution. The above-mentioned automated force-field strategies primarily aim at selecting the right atom types, and extract the corresponding parameters from a database. While these atom types have historically been handcrafted by chemical intuition, ongoing efforts aim at generalizing its concept using more robust annotators. For instance, the Open Force Field Initiative is applying so-called direct chemical perception by the use of SMIRKS patterns-linear notations encoding atoms and bonds [80].
The tendency to encode increasingly many atom types begs the question: is there a continuum limit? In effect this is precisely what is probed by machine learning (ML) models that span (subsets of) CCS. While we defer a broader discussion on the topic to Section 2.5, we note that kernel-based methods, such as Gaussian process regression (GPR), assume and enforce smoothness of the input space by the kernel function [81].
It leads to a continuum description of a so-called atom-in-molecule representation, a concept strongly utilized in hard condensed matter [73]. ML models learn a smooth interpolation between many-body atom-in-molecule representations and a target property of interest. ML has rapidly demonstrated impressive capabilities to interpolate increasingly large subsets of the CCS to complex electronic properties. Examples include atomization energies [82], dipole polarizability tensor [83], and multipole electrostatic coefficients [84].
How do we incorporate ML models into force fields? One straightforward approach is to work simultaneously with both: physics-based force fields encode the functional forms and asymptotes that we know, while ML models predict composition-and conformation-specific environments. This approach can lead to excellent accuracy and transferability, reproducing highly accurate coupled-cluster calculations across several molecular datasets, and without the need for any reparametrization [85]. Li et al. have used ML models to predict quantum-mechanical properties, used as input for a polarizable force field, and match liquid-state observables [86]. In both cases the highresolution of the physics-based models-they are both explicitly polarizable-enable a purely ab initio parametrization.
The more ML-centric alternative is to let go of functional forms entirely. Several applications show that this can lead to excellent many-body ML potentials for a variety of molecules and materials [87,88,89]. Moving beyond single systems and toward subsets of CCS is still a subject of ongoing research: most of these approaches have so far focused on a careful interpolation of the conformational space, and the compounded interpolation of composition requires significant adaptations (Section 2.3.2). We point out the ML neural network potential ANI as a notable example in this direction [90]. We also note the challenge of accurately modeling long-range interactions, for instance by appropriate physically inspired kernels [91].
Going down in resolution, developing CG models takes the simulator down either one of two main tracks: top-down or bottom-up [51]. The top-down approach, which builds from phenomenological considerations, may turn out easier to automate in the case that there is a straightforward link between the reference information and the interaction potential. Consider the popular CG Martini force field for biomolecular systems [92]. The automated CG Martini parametrization scheme can read in any small organic molecule, optimize a mapping using a set of heuristics, and predict a chemical fragment water/octanol partitioning coefficient from a neural network for each bead type [93]. Bead types of CG models can be further redefined to best accommodate for the diversity of compounds in the CCS [94]. On the other hand, the bottom-up route starts from microscopic information of a higher-resolution simulation. Systematic parametrization schemes exist, such as iterative Boltzmann inversion or force matching, accompanied by convenient software platforms [95,96]. Aside from the CG potentials, bottom-up strategies can strongly benefit from a more systematic optimization of the mapping itself [97,98]. Combinations of structure-based CG and ML have recently sparked interest and are quickly enabling new avenues, see below Section 2.5.3.

System preparation
System preparation for an MD study has two main tenants: (i) the initial configurations and (ii) the procedure to run the simulation and compute observables (e.g., structural parameter or free energy). Controlling the latter is typically relatively easy, as it often boils down to applying the same simulation pipeline. Building initial configurations in an automated and consistent way, on the other hand, can require more sophisticated approaches: A screening study that focuses on protein-ligand binding must first dock every single compound in the protein pocket. Beyond the proper geometric alignment of the ligand, the condensed phase of a liquid calls for packing of the molecules involved, and thus a delicate placement to avoid steric clashes. This has led to a variety of tools to initialize condensed-phase, soft-matter systems: Martínez et al. designed PACKMOL to create simple liquids, mixtures, and more complex architectures, such as micelles and lipid bilayers [99]; Polymer Modeler is a polymer chain builder [100]; CHARMM-GUI is a sophisticated web server to facilitate the initial configuration of biomolecular systems, such as solvated proteins, and phospholipid membranes [101]; the INSANE script sets up complex phospholipid-membrane mixtures for the CG Martini force field [102]; MemProtMD elegantly prepares CG configurations of membrane proteins by self-assembling the phospholipid membrane around the experimentally resolved protein structure (Section 3.6) [103]; both the Python-based MoSDeF and Hoobas frameworks offer extensible molecular-building capabilities (e.g., patchy DNA-grafted colloids in Hoobas), and the use of Python allows for deeper integration of system initialization and simulation/analysis [104,105].

Sampling
Sampling lies at the heart of molecular simulations: both molecular dynamics and Monte Carlo simulations implement efficient importance-sampling algorithms to navigate a representative subset of the conformational space [39]. But sampling takes on a whole new dimension in the context of this review: not only does a simulation aim at sampling conformational space, compound screening is also a sampling problem-this one in compositional space. Here we limit our overview to recent methods that aim at sampling either space. The use of similar techniques to tackle both spaces is no coincidence, it highlights their resemblance and the associated sampling challenges.
2.3.1. Conformational sampling. The conformational space represents the structural distribution function of the system. A collection of N particles will give rise to a continuous 3N -dimensional space of microstates. The statistical ensemble used to probe the system biases the weighting of the states, e.g., the Boltzmann distribution in the canonical ensemble. This bias means that not all microstates contribute equally, and instead an efficient conformational-sampling strategy should focus only on the more important ones.
More conformational sampling is almost always desired: simulating larger and more complex systems potentially opens up new insight unattainable before, but also helps testing for convergence issues [106,107]. Limited computational resources limit how long the simulations can be, and instead offset many efforts in sampling more efficiently. Several excellent reviews cover the vast and rich area of enhanced-sampling techniques [108,109,110,111]. ML, and in particular deep learning, has opened up a number of new avenues in terms of facilitating conformational sampling [25]. For instance, autoencoders display an architecture that is prone to enhanced sampling: its symmetric bow-tie network, while simply aiming at reconstructing the input sample, forces an information bottleneck in the so-called latent space. Describing a system through this reduced dimensional latent space brdiges naturally to the use of collective variables in enhanced sampling. A famous variant to autoencoders, the variational autoencoder, uses a variational approach to learn the latent representation, resulting in both a generative model and a smooth latent space that enables interpolation [112]. Various studies have leveraged the architecture of a (variational) autoencoder to learn a low-dimensional latent representation of the input conformational space [113,114] or extract the long-time kinetics [115]. The added accuracy one can gain by using ML often comes at the cost of interpretability: how do we express the latent-space dimensions-the collective variables-in terms of simple, physically meaningful coordinates? Ribeiro et al. proposed to iteratively refine a set of proxy reaction coordinates that best emulates the latent-space distribution [116].
Other approaches do away with collective variables, and instead use unsupervised learning as a way to chart a low-dimensional free-energy surface. Chiavazzo et al. have devised a method that iteratively proceeds between MD and nonlinear manifold learning techniques to expand the system away from regions already explored [117]. Expanding conformational space using dimensionality reduction was also proposed by Kukharenko et al. [118]. They used the multidimensional-scaling scheme sketchmap [119] to project the points and initiate swarms of simulations from sparsely (but existing) sampled regions. The generation of molecular configurations that have not been previously sampled was subsequently proposed by means of a loss function that combined an autoencoder reconstruction loss and the sketch-map cost function [120]. The combination of the two approaches effectively appears to achieve features in line with the variational autoencoder: the data-driven learning of a smooth latent-space distribution, coupled to a generative model.
Beyond techniques aiming at enhancing the conformational space sampled, others have tried to blend in qualitative external knowledge-a prior of sorts-to drive the molecular dynamics. Perez et al. employed Bayesian inference to guide protein-folding MD from coarse physical knowledge, such as "form a hydrophobic core" [121]. Folding times were reduced by several orders of magnitude, illustrating that the body of insight about protein folding can be leveraged to speed up protein simulations. Strategies to blend numerical methods or algorithms with heuristic prior knowledge is bound to be useful in other areas.

Compositional sampling.
The chemical compound space (CCS)-the space of all possible molecules or compounds-differs from the conformational space in at least two major ways: First, it is discrete. While microstates can be continuously transformed into neighboring ones, different molecules cannot be arbitrarily close, because of basic chemical rules (e.g., valency). In other words, very few spatial arrangements of atoms will lead to chemically stable compounds. Although there are computational treatments to continuously transform molecules (vide infra), the common setting is to dedicate different simulations for different molecules.
The second defining feature of CCS is its size: the dimensionality of the space is not a simple function of the number of particles. Natural proteins can be built by combinations of 20 amino acids, meaning that there are 20 n unique sequences of chain length n. For very short peptides of length n = 10-barely long enough to stabilize any secondary structure-this already leads us to a space of 10 13 compounds. The increased variety of chemical groups in synthetic polymers will evidently yield a much larger CCS. Now consider small-drug like molecules that obey Lipinski's "rule of five"-restricting the molecular weight, hydrophobicity, and number of hydrogen bonds-which capture the physicochemical properties of most orally active drugs [122], its space is estimated at 10 60 chemically stable molecules [14]. There are not enough carbon atoms in the universe to synthesize all of them! What can we do, then? Just like microstates, not all molecules are made equal-most will yield uninteresting properties. Focusing on the ones with desired properties is precisely the answer to solving the inverse problem (Section 1.2).
While overwhelmingly large, important steps in better grasping the size and scope of the CCS of drugs have been made. Reymond and coworkers have sidestepped the minuscule, inconsistent collection of synthesized drug-like molecules by instead constructing them algorithmically [49,18]. Graph-based methods combined with valency rules offer a systematic way to enumerate large subsets of CCS-most of which have never been synthesized. The so-called "generated database" (GDB) enumerates a dense coverage of molecules containing a set of elements up to a threshold in number of heavy atoms: the GDB-17 contains 10 11 molecules up to 17 heavy atoms of C, N, O, S, and halogens [123]. Beyond their identity, computing properties of these dense subsets has subsequently been subject to much activity, because they enable the training of ML models (Section 2.5). The GDB has been used for the calculation of electronic properties, typically from density-functional theory (DFT), of increasingly many compounds: Rupp et al. When tackling the exploration of CCS, coarse-graining can offer significant advantages. Top-down, phenomenological CG models focus the modeling on the essential ingredients or driving forces at play [51]. This minimalistic approach can lead to generic-if not universal-behavior that broadly applies to many systems. One famous example is the Kremer-Grest polymer model [126,127]. Zhang et al. demonstrated that a melt of this phenomenological model can broadly be backmapped to many different types of homopolymers [128]. Everaers et al. recently matched the generic large-scale behavior of Kremer-Grest simulations to chemistry-specific experiments via the Kuhn length [129]. . Transferable coarse-grained models can reduce the size of chemical compound space: fewer coarse-grained (CG) compounds are required to probe a subset of chemical space. They make use of a finite set of bead types to introduce a degeneracy in the CG representations of chemical compounds [93].
While the link between top-down CG models and the underlying CCS often remains qualitative, there can be approaches to establish it. Many of these top-down models are transferable, in that they define a limited set of interactions of bead types to encode the variety of chemical groups. In case of the popular Martini model the bead types roughly span the hydrophobicity scale [92]. This limited chemical resolution means that molecules alike will often map to the same CG mapping. This critically introduces a degeneracy in CG representation of small molecules, and effectively a reduction in the size of CCS. Figure 3 illustrates the use of Martini for small molecules: it can lead to a reduction in chemical space by roughly 3 orders of magnitude. The mapping from molecules to CG representations is straightforward to establish using automated parametrization schemes of GDB-type libraries [93,130]. This reduction of the size of CCS can be applied to significantly boost the compound screening of thermodynamic properties-one example will be covered in the context of drug-membrane interactions, Section 3.8.
Beyond mere enumeration or serendipitous picks, there are more efficient ways to explore CCS. Virshup et al. devised an algorithm to stochastically grow an initial set of compounds to maximally diversify it, restricted to specific properties (e.g., drug-likeness) [131]. They reported a library of 10 4 compounds representative of the GDB-13, yielding a 10 4 reduction factor while retaining its diversity. Such an approach is likely to go hand in hand with the training of ML models, which require a good balance between chemical similarity and a representative coverage of the interpolation space. At the other end of the spectrum, Hoksza et al. presented the Molpher framework, which provides a (discrete) path in chemical space between a pair of compounds [132]. It performs a series of simple structural molecular changes, such as atom addition or removal, from start to target molecule.
Other approaches at sampling CCS emphasize the (bio)chemistry or physics of navigating across molecules. Taking inspiration from nature has led to the adaptation of Darwinian-type directed evolution [133]. Computational directed evolution has so far mostly been applied to protein design, and more specifically to enzymes [134]. Leveraging the aptness of computational physics to perform importance sampling, a Markov Chain Monte Carlo scheme can efficiently sample across CCS [135]. Closer to reproducing a laboratory experiment, Wang et al. implemented an ab initio nanoreactor, leading to spontaneous chemical reactions and the formation of molecules through a variety of pathways [136]. Such a computational setting holds great promise in studying in more detail the origins of life [137].
While most of these approaches tackle CCS in its discrete form, continuous explorations may well prove extremely strategic. However, connecting compounds in a continuous manner requires some craft. One notable example is the alchemical transformation, a powerful tool in statistical mechanics to compute free-energy differences [138]. It relies on a crucial property: state functions do not depend on the path taken, and instead permit non-physical-alchemical-interpolations between two compounds (more on this in Sections 3.2 and 3.3). A corresponding framework can be used to compute ab initio energy gradients and other changes in properties upon local changes in CCS [139,140]. Aside from the relevant materials properties, the inclusion of derivatives may help in more efficiently mapping structure-property relationships [141].
Another strategy to circumvent the discreteness of CCS consists of imposing a continuous proxy. Such a proxy will enable continuous-optimization schemes, thereby facilitating molecular design. Wang et al. employed a linear combination of atomic potentials to establish a continuous property landscape [142]. In a similar vein, von Lilienfeld et al. relied on an energy functional based on the nuclear and electronic chemical potential [143]. With the advent of deep learning, new solutions have been proposed: Gómez-Bombarelli et al. used a variational autoencoder (covered in Section 2.3.1) to not only reduce the CCS, but more importantly to smoothen it [144]. Built in the variational autoencoder, the representation of the latent space allows a continuous exploration of the CCS. The architecture was connected to a surrogate model, whose objective was to predict a target property in the reduced latent space, enabling continuous optimization. This active-learning, Bayesian-optimization approach has lately been applied in the context of soft-matter systems by Shmilovich et al., as described in Section 3.7 [145].

Data infrastructure
Assuming all technical requirements permit MD simulations at high throughput, the question arises: what to do with the data? Handling large collections of MD simulations can easily require extensive storage solutions. More importantly, it poses the problem of data sharing-not only between group members and collaborators, but across the community at large. Recent cultural shifts in science are increasingly encouraging the dissemination of research data. A collaborative and open-source approach to scientific endeavors can strongly accelerate the pace of research [146]. Databases of experimentally determined materials properties, for instance for polymers, can prove invaluable to extract structure-property relationships and assist in designing better materials [147,148,149].
What to do, then, to publish large collections of MD simulations? An increasing number of online repositories dedicated to hosting scientific data have come about, Zenodo [150], figshare [151], or the Open Science Framework [152], to name but a few. These databases are generic in that they are agnostic to the type of scientific data, unlike, say, the Protein Data Bank (PDB), which specializes in biomacromolecular structures [153]. The next question is the data format. One straightforward solution is to simply compress all the input and output files of a set of MD trajectories and upload them as is-a strategy our group adopted to publish hundreds of umbrellasampling MD trajectories [154]. This lets anyone freely access the data, but presents caveats. Notably, (i) it does not facilitate automated strategies to search and collect information about the data, and (ii) the input/output formats are tied to MD software used to generate the simulation trajectories. This is more formally denoted by a lack of data labeling-or metadata-and data normalization, respectively. The convenient access, retrieval, and categorization of heterogeneously generated data is key to assemble large databases, amenable to training ML models (more on that in Section 2.5). Such a framework has been formalized by the FAIR principles: data that is Findable, Accessible, Interoperable, and Reusable [155]. The new era of computational materials design mentioned in the Introduction is in no small part made possible by a robust data infrastructure in materials science [156]. Publishing large FAIR datasets is becoming increasingly widespread, thanks to solutions like the Materials Data Facility [157]. The development of a number of data-infrastructure platforms, such as NOMAD and the Materials Project, strive to label electronic-structure calculations by detailed metadata, parse many codes and normalize the input and output information, and offer access via a webpage or a programmatic interface [158,159]. Several consortia are working their way toward more robust data infrastructures for molecular simulations, including OpenKIM [160,161], MOLSSI [162], and FAIR-DI [163]. Recent examples show that the interconnection of specialized databases can help automate the metadata annotation process, as will be described in Section 3.6.

Data analysis
Once the difficult task of generating MD-based compound databases is over, a second one starts: the data analysis. Here we will rely on the concept of data-scale, already introduced in Section 1.3. Figure 1 illustrates that the number of compounds largely determines the type of statistical modeling. This constraint stems from the expressivity of a statistical model, which depends largely on the number of parameters of the architecture and dimensionality of the representation, which themselves require larger training set sizes. We structure what follows in terms of the data-scale by means of the variable N compounds , from the traditional setting of craftmansship, to data mining in the low-data regime, to kernel-based ML methods, to deep learning.
2.5.1. Craftsmanship. Working in a regime N compounds ∼ 1 leaves little room for data-driven analysis methods. It instead embodies the traditional setting of gathering insight driven by physical theories, experiments, prior computer simulations, or simply intuition.

Data mining.
Moving up to N compounds 10 can offer enough information to systematically search for simple structure-property relationships. The low number of samples puts a strong limit on the dimensionality of the sample information-the descriptors. Relating low-dimensional descriptors to materials property has enjoyed great attention for decades, embodied for instance by so-called quantitative structureproperty relationships (QSPR) [34,164]. QSPR is a well-established, powerful method to functionally relate chemical structure to property. It relies on a set of descriptors, typically combined using a (multivariate) linear fit. More recent applications have turned to using the kernel trick to convert a non-linear problem into a linear one, support vector machines can then highlight the most important descriptors, and we further note the increasing use of artificial neural networks [165,166]. While much attention of QSPR has been devoted to drug discovery, we also note other soft-matter applications, such as the self assembly of conjugated oligopeptides (more on that in Section 3.7) [167] and the tribology of functionalized, lubricating monolayer films [105].
A more recent take on the functional discovery of structure-property relationships brings us to learning more complex equations. Compressed-sensing methods extend QSPR to expand the complexity of the functional relationships tested. They rely on a large combinatorial consideration of trial candidate equations, and a greedy l 1norm optimization scheme to minimize the number of non-zero coefficients. Examples include the symbolic regression of nonlinear dynamical systems [168] and equations from the Feynman Lectures on Physics [169]. Ghiringhelli et al. used least absolute shrinkage and selection operator (LASSO) to extract functional relationships between descriptors that can accurately classify between zinc blende and rocksalt semiconductors [170]. Ouyang et al. refined the approach using the sure independence screening and sparsifying operator (SISSO), which hierarchically searches for combinations of descriptors [171]. Rather than building a single surrogate model aimed at explaining the entire dataset, another method called subgroup discovery focuses on coherent homogeneous subsets. Goldsmith et al. revisited the zinc-blende/rocksalt semiconductor problem and identified separate regions with strict constraints [172]. These models are of particular interest at a time where ML models are increasingly criticized for their lack of interpretability: identifying the explicit role of the input variable in the structureproperty mapping.
By and large, these approaches aim at capturing the essential variables or descriptors that dictate the target property. This dimensionality reduction aims at garnering insight into the problem at hand, ideally by visualizing how the minimal set of descriptors link to the property. The systematic construction of reduced dimensional representations is a vast field, one that naturally connects to unsupervised-learning techniques [173].

2.5.3.
Kernel-based supervised learning. The regime N compounds 10 3 is amenable to the optimization of much more expressive models. These are often called surrogate models, because a prediction bypasses the need for further computer simulations. We refer the reader to several excellent reviews on the use of (kernel-based) ML for molecular systems [4,88,174,25,6]. Compared to QSPR methods, ML methods are free of fixed functional forms, and instead offer flexible interpolation between training points in a high-dimensional feature space [81,175]. ML models exploit similarity in several ways: they first impose a metric, allowing us to measure distances in CCS, a critical ingredient to both explore and sample from that space (Section 2.3). Similarity is explicitly assumed by enforcing smoothness between input space and target propertyan aspect that helps interpolate between training points.
The increased expressivity of ML relies on the use of higher-dimensional input information, representations, rather than mere descriptors. Representations offer a more detailed-many-body-description of the system, such as a molecule or an atom in its local environment [82,176,177,178]. The need to account for physical symmetries was recognized early on [179]. The Noether theorem states that symmetries in a physical system lead to conservation laws and invariants. Empirically learning these invariants often requires significant amount of training data-encoding them in the representation or the ML architecture can lead to significant learning improvement [180]. As a result, translation, rotation, or (when applicable) permutation invariance often form the basic requirements for ML representations. Symmetries can be added to the kernel itself, notable examples include the learning of vectors by covariant kernels [181] or energy-conserving force fields via the Hessian [89,182,183]. Additional constraints can be added as well, for instance a decomposition ansatz when the target property lumps several terms, useful to decompose reference forces [87], atomic dipole moments [184], or free energies [185]. Kernels turn out to be extremely convenient to encode physical constraints because they work within the realm of linear algebra. Extending these properties to neural networks and deep learning is more challenging, though the improved expressivity has motivated active developments (vide infra).
The lessons learned to build ML models in chemistry and materials science largely transfer to soft matter and biomolecules, where similar constraints on the representation prevail [186]. Screening studies that make use of kernel-based ML have become prominent, for instance in protein-ligand binding, but many typically use experimental data [176]. Using MD, the relevant data-scale regimes typically require a CG approach. For instance in drug-membrane thermodynamics, CG simulations of ∼ 10 3 systems led to predictions for 1.3 · 10 6 molecules, thanks to the CG model's reduction of CCS [135]. The predictions satisfied thermodynamic relations observed on smaller data sets, strongly suggesting robust generalization. While this study was based on a top-down CG model, systematic approaches like the variational force-matching method bode elegantly well with the loss function of an ML model. This has resulted in several studies, and in particular efforts at addressing the challenging question of mapping many atomistic configurations to a single CG geometry [187,188,189,183].
Several challenges still lie ahead for a more robust description of condensed liquidstate systems. For instance, a (macro)molecule is never isolated, but embedded in its environment, such that a representation may benefit by incorporating the neighboring solvent's degrees of freedom [190]. The nature of the systems naturally calls for the development of ML-based force fields that incorporate long-range interactions [191], as well as more particle types. We also point out the critical role of the configurational aspect: a single geometry is not representative, but rather should incorporate information about the underlying Boltzmann distribution [185]. More than anything else, high-quality ML models require extensive training data. Soft matter needs large, homogeneous databases analogous to what has been developed from DFT calculations for electronic properties, e.g., the QM9 database [124].

Deep learning.
The extraordinary results achieved with deep learning in so many scientific and technological fields have to do with the added expressivity of these models. Using a neural-network architecture that connects several layers of nodes, input and output can be mapped to generalize surprisingly well [192]. Compared to the above-mentioned regimes, the added expressivity of deep learning comes at a price: many more training data points are necessary to parametrize a model, typically in the range N compounds 10 6 . The benefits of deep learning are far reaching: notably for drug discovery-though so far with data generated from experiments [193,194], we also outlined some of the distinct conceptual advantages a deep-learning approach offers for sampling both across conformations and compositions (Section 2.3.1). In terms of representing molecules, the inclusion of symmetries is also an essential aspect, requiring extensive methodological work [195,196]. They open the door to so-called physics-informed neural networks, which aim at a synergistic combination of the two approaches to reduce the training data, effectively regularizing in small datascale regimes [197]. Deep learning offers exciting opportunities: for instance graph convolutional neural networks (CNNs) offer a physically intuitive representation for molecules, where nodes and edges represent atoms and bonds. Graph CNNs offer appealing features: differentiable, more easily interpretable, and better performing than commonly used molecular fingerprints [198].
Harnessing the full potential of deep-learning models puts stringent requirement on the number of compounds, which severely restricts what can be achieved in terms of screening studies. Few MD studies have reached data-scale regimes amenable to deep learning, but impressive first steps show much promise, such as the prediction of transfer free energies in lipid membranes [199]. It offers a glance at the use of MD-based studies to train deep-learning models across the CCS of biomolecular and soft materials.

Screening applications
The following describes a number of MD-based screening applications for various softmatter and biomolecular systems. We order the applications roughly in the number of compounds screened, from low to high, and grouped by topics when deemed fitting. Beyond the range of screening sizes, some of these applications result from intense and long-standing scientific activities. For those, the present review cannot do justice to the breadth of these research topics, but will hopefully stimulate the reader in diving into complementary readings.

Exploring conformational space with swarms of trajectories
Far from a screening at high throughput, this first application focuses on the study of individual (macro)molecules. While slightly deviating from the greater objective to screen across compounds, the conceptual approach and implementation undertaken here is relevant for our topic, as it provides innovative solutions to exploring conformational space.
The problem at heart involves the determination of kinetic properties for systems exhibiting relevant processes at long time scales-long compared to what would be considered reasonably achievable by a single trajectory on a supercomputer. Supercomputers tackle ambitious simulations by means of CPU or GPU parallelization. Unfortunately, not everything is easy to parallelize: While one can easily segment a simulation box to treat smaller cells concurrently, MD numerically integrates the equations of motion in a serial fashion-it is difficult to parallelize time. Folding@Home tackled the problem by introducing two complementary aspects: a conceptual approach to circumvent the long-time-scale sampling problem, and a platform to implement it [200].
The dynamics of complex systems is typically dominated by free-energy barriers: thermal fluctuations will lead a system to dwell in a conformational basin (i.e., a local minimum), before being spontaneously pushed over a barrier. Assuming singleexponential kinetics with (unknown) rate k, the probability for the system to cross the barrier by time t is given by P 1 (t) = k exp(−kt). Rather than wait for a single trajectory to cross over once, let many copies attempt it over a short time. In the case of M simulations, the probability for the first simulation to cross within the same time t is now P M (t) = M k exp(−M kt), exhibiting an effective rate that is M times faster. The pioneering work of Pande and coworkers demonstrated the value of the approach: running multiple instances of a short simulation boosts the chances of seeing early crossing events, and sufficiently many occurrences allow them to estimate the rate k, as illustrated on the folding of small peptides and polymers [201].
The second breakthrough of the Folding@Home consortium was to establish a distributed-computing platform, powered by idle CPU power contributed by anonymous users over the internet [200]. Running many short, uncoupled simulations meant that they did not need to run on the same supercomputer. All simulation instances need no communication, since they independently sample the same conformational space. Practically this was simply realized by M copies of the same initial configuration (typically with different seeds and velocities), since the stochasticity of the dynamical process will quickly lead to diverging trajectories.
One of the early examples of the Folding@Home project aimed at the folding kinetics of two mutants of the designed, 23-residue-long mini-protein BBA5 [202]. With a mean folding time on the order of 10 µs, it is considered a fast-folding protein, yet very much a challenging time-scale for an all-atom MD simulation-especially at the time the research was conducted. Following the above-mentioned reasoning for singleexponential kinetics, they estimated that for such a folding timescale, roughly 10 out of 10,000 individual trajectories should fold after 10 ns. Using an implicit-solvent unitedatom model, they showed that an impressively large number of short simulations yielded excellent agreement with laser temperature-jump experiments.
Folding@Home has made significant contributions in elucidating the protein-folding problem in silico [61,203]. Early applications were then superseded with Markov state models, a more robust memoryless master-equation treatment of the kinetics, pioneered by Noé, Pande, Chodera, Bowman, and others [204,205,206,207,208].
Moving away from protein folding, a more recent application of distributedcomputing platforms focused on protein-ligand binding. Using their distributedcomputing platform GPUGRID, De Fabritiis and coworkers demonstrated the value of the approach for PMF calculations for standard binding free energies [209]. Buch et al. reported an impressive study of the enzyme-inhibitor complex trypsin-benzamidine: they performed 495 unbiased MD simulations of the unbound ligand for 100 ns each [210,211]. They sampled a variety of binding events, but also several pathways, allowing them to robustly estimate both the binding free energy, as well as the on and off binding rates. Extensions to the modeling of protein-protein association kinetics form to date one of the most impressive developments in this area [212].
Distributed-computing platforms have had a conceptual impact as to how the community increasingly approaches MD simulations: from handcrafted, individual instances to swarms of trajectories. The associated need for automation paves the way for different kinds of high-throughput MD simulations. Spawning MD trajectories has since been extended to exploring uncharted regions of the free-energy landscape using machine learning [117].

Protein-ligand binding
The ever-growing penetration of computational chemistry in drug discovery has experienced its shares of challenges [213]. Like any complex engineering problem, the design of a drug entails many considerations and complementary problems to solve. From membrane penetration, to toxicity, to pharmacokinetic and pharmacodynamic considerations, we focus here solely on the determination of protein-ligand binding.
Basic structure-based drug-design methods typically assume rigid drug-target structures: starting from a crystal structure or homology modeling, a ligand is docked near the receptor's active site; the molecular configuration is then used to estimate binding, often using empirical scoring functions as a proxy. While this type of virtual screening accommodates a large number of compounds, it models the complex as mostly rigid. The lack of flexibility is an issue, given the recognized role of the conformational ensemble in biomolecular activity [214]. The field moved from a static lock-and-key binding paradigm to more dynamic pictures, such as induced fit or conformational selection. This emphasizes the need for physics-based methods that model not only structural flexibility, but more broadly the relevant emergent phenomena following binding [215].
Beyond flexibility, an accurate account of the binding free energy is desired. Free energies are ensemble properties, making the scoring of any individual configuration a conceptually peculiar exercise. Several methods have been developed and tested over the years-the drug-design field having explored many methodologies to strike the right balance between accuracy and throughput: from end-point methods to rigorous calculations derived from statistical mechanics.
One prominent example of an end-point method combines MD simulations on the bound and unbound configurations, using an implicit solvent and a Poisson-Boltzmann surface area solvation term (MM-PBSA). Brown and Muchmore applied MM-PBSA to a set of 308 ligands bound to one of three protein receptors [216]. The breadth and scope of the study is laudable: moving toward a high-throughput MD scheme to extract free energies of binding. The moderate correlation coefficients (Pearson coefficient R 2 = 0.5 − 0.7) are unfortunately a testament to the difficulties end-point methods display in reliably directing drug discovery [217,218].
Alchemical transformations provide a rigorous framework to compute binding free energies [138]. Though many methodologies exist [219,220], we mention one equilibrium techniques that aims at calculating the free energy upon transforming from state A to B: Free-energy perturbation, introduced by Zwanzig [221], relies on exponential averaging where r denotes the system's particle coordinates, H A is the Hamiltonian of state A, and · A is an ensemble average at state point A. Three decades ago, the pioneering study of Wong and McCammon presented an alchemical transformation between benzamidine bound to the enzyme trypsin [222]. A fascinating review by Jorgensen describes some of the successes of MD coupled with alchemical transformations to advance the drug-discovery pipeline [217]. While the generation of new scaffolds (i.e., entirely different structures) is naturally sought, so-called hit-to-lead optimization-refinement of the binding of a promising starting compound-is where alchemical transformations really shine. There are two reasons for this: (i) the computational expense of each alchemical transformation limits the screening to relatively few compounds, thereby limiting the chances of finding new scaffolds; and (ii) the interpolative nature of an alchemical transformation (i.e., overlap in the conformational spaces, see 1) leads to better convergence for similar molecules.
Alchemical transformations took a more systematic turn with the study of Wang et al. [223]. They reported relative free-energy calculations at an all-atom level with explicit solvent for an impressive 200 ligands. This feat was aided by the deployment of MD simulations on graphics processing units (GPU), as well as a streamlined procedure to prepare and run alchemical transformations. Critically, they optimized a "perturbation graph," which measures the maximum common substructure between any pair of compounds [224]. The algorithm minimizes the number of alchemical transformations, while accommodating for both multiple pathways to estimate statistical error and the presence of closed cycles (which ought to yield no free-energy difference). With a total of 330 perturbations, they reported a rootmean-squared error against experiments of only 1.1 kcal/mol. More recent work has reported alchemical transformations for up to several thousands of ligands [225]. Forcefield improvements, from OPLS2.1 to OPLS3 and OPLS3e have yielded systematic improvements in binding free energies [76,77].
Three decades of MD-based computational drug design have shown impressive developments: not only in the sheer number of compounds (from 1 to thousands reported in a single study), but more importantly in the convergence of the calculations via significantly longer simulation trajectories, and an overall improvement of the force fields. The significant contributions of industrial actors is a testament to both the pressing needs of the pharmaceutical industry and the opportunities offered by physicsbased MD methods.

Solvation of small molecules
The free energy of solvation of small molecules is in many ways an antechamber to protein-ligand binding: it consists of the free-energy difference of transferring a small molecule from the gas into a condensed-phase environment. Rather than a protein pocket, solvation is performed in a bulk liquid. The homogeneity of the medium makes the calculations easier to converge, typically allowing for broader studies that may accommodate significantly more compounds.
The study of Jorgensen and Ravimohan pioneered alchemical transformations by converting methanol into ethane [226]. They applied free-energy perturbation (covered in section 3.2) to compute the relative free-energy difference in hydration-solvation in water-of the two compounds. An alchemical transformation between these two similar molecules helps the calculation: it only requires decoupling the hydroxyl group and coupling a methyl in its stead.
Modeling solvation has had significant impact as a proxy for more complex phenomena-a prime example being protein folding (some of which was covered in section 3.1). The protein-folding problem was always strongly pushed by computer simulations [61]. Huang et al. reported an insightful study on hydrophobic solvation, they calculated the free energy of solvation for hard-sphere solutes of various sizes [227]. These solutes, though not directly linked to any particular chemistry, aimed at a better phenomenological understanding of possibly large hydrophobic regions exposed to water, such as in protein folding. Of particular interest was the systematic change in the solute size and comparison of the asymptotics against theory. In the same vein, the early 2000s witnessed intense activities in accurate calculations of hydration and transfer free energies of (neutral) amino-acid side-chain analogs [228,229,230,231].
Mobley et al. reported hydration free energies for a set of 44 small, neutral molecules [232]. A larger set of 239 small neutral organic molecules was later tested against various force-field parameters and charge models [233,234,235]. In parallel, Mobley et al. released the FreeSolv database, a set of 504 neutral small organic molecules, with comparison against experiments [236]. Such studies have led to the more routine incorporation of hydration free energies in validating force fields [76,77]. Scaling up, Bennett et al. recently reported an impressive 15 · 10 3 water-cyclohexane transfer freeenergy calculations from all-atom molecular dynamics [199].
Experimental free-energy datasets such as FreeSolv are useful because they cover much of the diversity of small drug-like molecules, although the small number of compounds necessarily limits how representative they are.
ML models of in silico hydration free energies trained on different datasets-both experimental and combinatorially generated-did not appropriately generalize across each other, highlighting biases in the chemical space covered [185]. Still, the increased size and breadth of the spanned chemical space allow researchers to identify systematic problems with force-field parameters for classes of compounds. The same holds true at the CG level: the automated Martini parametrization scheme for small molecules facilitates the calculation of partitioning free energies for several hundred molecules [93]. It helped identify systematic issues with certain chemical groups, such as rings or halogens, which new versions of the force field aim at correcting [237].
With a growing number of computational techniques to compute free energies, how can one compare their predictive accuracy in a fair way? Nicholls et al. set up an informal blind-test study, comparing different methodologies for 17 small molecules [238]. This was later formalized through the SAMPL challenge [239,240]. The blind tests consisted of teams applying their method to compounds for which solvation free energies are known but unpublished or relatively inaccessible. It avoids the risks of tuning model parameters that would skew results to seem artificially more favorable. SAMPL2 introduced an explanatory section to gain insight in (disclosed) unexpected experimental results [241]. Later challenges have since occurred and keep helping benchmark and refine computational methods [242].

Ionic liquids
Ionic liquids (ILs) are salts. They exhibit a melting point or glass-transition temperature below 100 • , while so-called "room-temperature" ILs remain liquid below 0 • . ILs typically exhibit good thermal stability, low vapor pressures, and are able to dissolve many compounds. This makes ILs interesting solvents in sustainable chemistry, with technological applications such as solvent for biomolecules or catalysis [243]. Critically, ILs are also conductive, which makes them candidates for use in electrochemical applications. In parallel, the combinatorics of association of cation-anion pairs leads to an extraordinary number of possible ILs. The combination of the breadth of chemical structures available and the variety of properties of interest has motivated a number of quantitative structure-property relationship modeling, albeit so far mostly exclusively from experimental data [34].
Computer simulations have played a significant role in better understanding ILs. Maginn pointed out that interests in ILs rose coincidentally with the advent of computer simulations, which have proven increasingly capable of shedding light on complex fluids [244]. The complex structural, thermodynamic, and dynamical aspects, including behavior at interfaces, viscosity, and dynamical heterogeneity motivated computational studies at various scales, from quantum-mechanical calculations to classical atomistic to coarse-grained modeling [245,246,247,244].
Turning to computational screening, Osti et al. reported an insightful study aimed at probing ion interactions and transport in solvated ILs [248]. They fixed the IL cationanion pair (1-butyl-3-methyl-imidazolium bis(trifluoromethylsulfonyl)), but screened across four organic solvents: acetonitrile (CH 3 CN), methanol (CH 3 OH), tetrahydrofuran (C 4 H 8 O), and dichloromethane (CH 2 Cl 2 ). The potential of mean force of separating a cation-anion pair suggested clear correlations between the energetics of the interaction and solvent polarity: a larger dipole moment is better able to screen ion-ion interactions, thereby decreasing the free energy of solvation. This clear trend was mirrored in the dynamics: ion diffusivity showed a linear increase against the solvent dipole moment. The results were corroborated by quasi-elastic neutron scattering experiments, overall offering clear structure-property relationships.
A larger, follow-up screening yielded surprising results [249]. Thompson et al. extended the set of systems they studied, both in terms of IL-solvent mixtures (18 increments in the range 0.1-0.95 mass fraction) and solvent chemistry (22 solvents including nitriles, alcohols, halocarbons, carbonyls, and glymes) for a total of 396 state points. This study both further confirmed a previously observed trend-IL mass fraction against IL diffusivity-and uncovered a new one-solvent diffusivity against IL diffusivity. Critically, they revisited the previously observed trend by Osti et al. between IL diffusivity and solvent dipole moment [248]: the incorporation of more compounds indicated no strong correlation across the entire data set. The authors hinted at the role of complementary solvent order parameters to recover clear trends. Combined, the two studies by Osti et al. and later Thompson et al. illustrate a decisive aspect: the inference of structure-property relationships hinges on a representative set of chemical compounds.

Silicate glasses
Glasses-materials that have been cooled significantly but without crystallizing-are known as structurally similar to but dynamically very different from liquids [250]. Glassy materials play a key role in many technological areas, motivating the optimization of their mechanical properties, from hardness to fracture strength to elastic properties [251]. Glasses embody an overwhelming class of materials, when considering not only the compositional aspects-potentially including a large number of elements of the periodic table-but also its strong out-of-equilibrium nature, meaning that the processing of the material can easily lead to kinetic traps.
Yang et al. recently presented a high-throughput MD study of silicate glasses, in an effort to predict their Young's modulus [252]. They covered the ternary diagram of calcium aluminosilicate (CAS), CaO-Al 2 O 3 -SiO 2 , by use of 231 compositions over the domain in 5%-mol regular increments. The authors ran MD simulations with tailored force fields [253] using a melt-quench procedure to prepare the configurations. All efforts were made at providing a consistent system-preparation and simulation protocol throughout the compositional space studied, but some limiting regimes required specific treatments: (i) Higher initial melting temperature for samples with high SiO 2 concentrations, due to their higher glass-transition temperatures; and (ii) Faster cooling rate for samples with high CaO concentrations, as they otherwise tend to crystallize. These aspects illustrate the challenges faced by the need for consistent protocols across large regions of chemical/compositional space.
From the simulation data, they predicted the Young's modulus across the compositional space using different statistical models. All their approaches-from polynomial regression to various flavors of machine learning-led to excellent results, indicative of both a dense sampling of the compositional domain and a smooth mapping to the target property. Interestingly, they showed that fitting models to available experimental data (∼ 100 points) led to severe biases: (i) Clustering of the available data leaves large domains without any training points; and (ii) Significant uncertainty and systematic errors between experiments can lead to large variations. While the latter aspect can be alleviated by means of adequate regularization, the former recalls the ever-present dangers of extrapolation.

Membrane proteins
Building up on the modeling of soluble proteins (see Section 3.1), membrane proteins form an important subset due to their biochemical impact: they form roughly 25 % of all human proteins [254] and half of current drug targets [255]. Membrane proteins typically exert significantly more complexity than their soluble counterparts. Transmembrane proteins in particular-those that span the membrane bilayer-evolve in a highly complex environment at the interface between the membrane and the aqueous environment. This complex environment is compounded by the large sizes that membrane proteins typically exhibit, often made of numerous α helices or a prominent β barrel. As a result, the size and heterogeneity of membrane proteins have made them challenging, not only for structure determination [256,257], but also for computer simulations [258,259,260,261].
The computational modeling of membrane proteins has benefitted heavily from particle-based coarse-grained models. An all-atom treatment of a protein and its surrounding lipid membrane remains to date a heroic effort: protein folding happens over much longer time scales in the membrane, due to the much larger correlation times exerted in the bilayer. Peptide folding and insertion in a lipid membrane has been reported at an atomistic level, although using an implicit-membrane description, thereby speeding up the peptide dynamics in the membrane environment [262]. Alternatively, coarse-grained models offer an appealing way to study peptide folding and insertion in explicit membranes, thereby offering the means to monitor how the peptide perturbs membrane structure [263,264].
A coarse-grained description of membrane proteins does not only allow to study folding and insertion for one of them, it can also be used to study a larger number of systems. Sansom et al. presented more than a decade ago an impressive protocol to automate the preparation of transmembrane proteins [265]. Starting from experimentally determined protein structures-typically deposited in the Protein Data Bank (PDB) [153,266]-these macromolecules typically lack structural information about the aqueous and membrane environments. Running MD simulations of a membrane protein requires first to solvate it in both a lipid membrane and an aqueous environment. Atomistic protocols typically start from equilibrated lipid bilayers and place a hole to incorporate the macromolecule [267]. Instead, the CG protocol of Sansom et al. did not order the lipids in any way, but rather incorporated them as an unstructured "soup." The soup spontaneously rearranged into a bilayer, thanks to self assembly and the speedy molecular diffusion at the CG level. Other CG based schemes have been developed to ease and automate the generation of complex lipid bilayers [268] and the assembly of membrane-protein multimers [269]. We note that the Martini-like CG model does not allow for secondary or tertiary structure reorganization, and is instead restrained around the crystal structure [270].
The pioneering database of Sansom et al. contained 91 membrane proteins and was made available together with a web server to easily visualize structural information [265].
Though no longer available today, the Sansom group later released an expanded database of membrane proteins: MemProtMD [271]. Based on a more sophisticated pipeline, the CG-based preparation protocol was amended by a backmapping to atomistic resolution [272]. They also more systematically imported structures from the PDB. The shear size and incomplete data annotation of the PDB led them to design structural descriptors to detect α-helical and β-barrel membrane proteins. An ensemble analysis across structures allowed them to gain insight in the probabilities of occurrence of amino acid side chains with respect to the depth in the bilayer. The MemProtMD database and associated web server contains more than 3,500 PDB entries [273]. A systematic connection with other databases brings in additional metadata to group structures according to their constituent proteins and family. The network of protein databases helps automatically annotate these structures with valuable information.
Beyond the screening of membrane proteins themselves, cell membranes embed these biomolecules in complex plasma membranes, made of a wide diversity of compounds. Corradi et al. studied the protein-lipid interactions for 10 membrane proteins embedded in a model plasma membrane made of 60 lipid species [274]. The authors identified clear "lipid fingerprints:" preferential association of certain lipid species to parts of the protein. This study highlights the combinatorial challenge involved, not only through the shear sampling of each system, but the extreme compositional diversity at hand.

Oligopeptide self assembly
The use of oligopeptides, consisting of a small number of residues, to self assemble nanostructures offers the promise of tunable supramolecular functionalities, yet with ease of preparation, biocompatibility, and degradability [275,276]. They are proving viable contenders for applications in biomedicine and nanotechnology [277,278]. Various types of nanostructures can be achieved, including fibers, tubes, and sheets [279,280]. This diversity stems from the vast combination of 20 natural amino acids into sequences.
In a series of studies, Frederix et al. have set up a systematic MD-based virtual screening protocol to establish clear structure-property relationships between the aminoacid sequence of short peptides and self assembly under aqueous conditions. Using the CG Martini force field, they first probed the ability to reproduce structural features of the well-characterized diphenylalanine (FF) peptide [17]. The aggregation of 1,600 dipeptides for 1.5 µs of simulation time (approximately accounting for the acceleration due to coarse-graining) generated a tubular nanostructure whose dimensions are in agreement with X-ray diffraction analysis of crystallized FF nanotubes [281]. This indicated that despite structural limitations of the Martini force field to model protein secondary structure, it could yield reasonable self-assembling features. Beyond the final structure, the simulations also helped understand the mechanism of formation: from an initial random placement to quick ordering into sheet-like aggregates, to vesicle formation, and finally long hollow tubes.
Scaling up, Frederix et al. screened exhaustively the space of all possible 20 2 = 400 dipeptide combinations [17]. Although coarse-graining significantly speeds up the simulations, the scope of the study led the researchers to rapidly probe early determinants of aggregation. They followed the self assembly of 300 dipeptides for 400 ns. They scored the peptides' aggregation propensity by means of the solventaccessible surface area, relative to the initial well-mixed configuration. The score was in good qualitative agreement with experimentally resolved structures, for the few sequences available. Though in need of atomistic refinement, the study highlights how CG simulations can sketch the mapping between sequence and self-assembled nanostructure.
A follow-up study aimed at the broader exploration of all tripeptides: 20 3 = 8, 000 in total [282]. They sought compounds that simultaneously favored aggregation propensity and hydrophilicity. While a priori contradicting requirements, their results testify to the broad diversity of possible systems, including subtle intermediates capable of displaying surprising properties. Extending the dipeptide study, their aggregationpropensity score was combined with the water-octanol partitioning coefficient to measure hydrophobicity. They identified a significant number of peptides that were not strongly hydrophobic, yet exhibit aggregation. The screening confirmed and extended design rules for the placement of specific amino acids in a particular position [283,284]. This includes steric effects in the placement of aromatic residues close to the N-terminus, but also charged amino acids on positions 1 and 3 as an architecture for intermolecular salt-bridge formation. Critically, their virtual screening procedure led for the first time to the subsequent synthesis and experimental characterization of tripeptides able to form hydrogels at neutral pH.
More complex oligopeptides were considered more recently by Thurston and Ferguson: a synthetic peptide-Π-peptide symmetric triblock architecture of the form NXXX-Π-XXXN, where X are amino acids and Π is a conjugated aromatic core [167]. To limit the space of candidates, they initially restricted their study to one of two aromatic cores, naphthalenediimide and perylenediimide, and the five amino acids A, F, G, I, and V were motivated by prior work. Aiming at optoelectronic functionality, their design objective targeted the stabilization of π-π stacking between neighboring oligopeptides, measuring the distance between aromatic cores as a proxy for electronic delocalization. They relied on an atomistic resolution with an implicit-solvent model to more efficiently sample the conformational space. Both free energies of dimerization and trimerization were calculated using enhanced-sampling MD on 26 peptides. Intermediate values of the dimerization and trimerization free energies led to the most favorable properties, as a tradeoff between sufficient interaction strength to drive assembly, yet little enough to avoid kinetic traps. A quantitative structure-property relationship (QSPR) model was then trained on these select peptides and a large set of 247 molecular descriptors, based on the PaDEL software package [285]. The authors motivated their choice over more sophisticated machine learning approaches both for its interpretability, as well as the dataset's high-dimensional, low-sample size regime. Further MD validation confirmed the predictability of the QSPR for largely apolar sequences-similar to the 26 training peptides-and proposed a new sequence unstudied by experiment. While the QSPR lacked transferability to strongly polar residues, the results indicate that adding a limited set of MD simulations should be straightforward and effective.
A wider study, also aiming at optimizing optoelectronic properties, was recently reported by Shmilovich et al. [145]. The synthetic architecture DXXX-OPV3-XXXD used a three-repeat oligophenylenevinylene π core, for its ability to assemble into optically and electronically active nanoaggregates [286]. Compared to the study of Thurston and Ferguson, the wider space of 20 3 = 8, 000 peptides was tackled by two complementary strategies: (i) CG simulations using the Martini force field, and (ii) a deep representational active learning approach. Following the pioneering work of Gómez-Bombarelli et al. [144], they projected the discrete sequence space into a lowdimensional continuous representation. A variational autoencoder was used to train a latent-space embedding [112], based on basic topological features of the CG beads of the Martini model. They trained a Gaussian process regression (GPR) on the latent-space embedding to predict the propensity of self assembly, and used a Bayesian optimization to select the "next best" candidates to be simulated. Iterating over several generations of this loop, they were able to converge the GPR model by only simulating 2.3 % of the space of sequences. This computational design platform, which aptly combines molecular simulations for compound measurement and data-driven methods to efficiently sample the sequence space, holds many promises for the virtual screening of biomolecular and soft materials.

Drug-membrane permeabilities
One beloved application of biomolecular simulations is the cell membrane. Though composed of a large variety of molecules, many are phospholipids. These amphiphiles can spontaneously self assemble to form large mesoscale structures, such as vesicles. This compartmentalization of the cell can still allow for exchange of (macro)molecules-either via active transport (biology), or passively by simple diffusion (thermodynamics). This latter aspect can be considered by the concentration gradient of a solute molecule, such as a drug, across a soft interface between two aqueous environments. Expressing this as a one-dimensional Smoluchowski equation along the normal to the membrane, z, leads to the inhomogeneous solubility-diffusion model [287,288]. The resulting quantity is the permeability coefficient, P , a flux that accounts for the heterogeneity of the environment by integration over z the energetics of crossing together with the local diffusivity In this equation, β = 1/k B T is the inverse temperature, D(z) is the local diffusivity, and G(z) is the potential of mean force (PMF)-it is the free energy required to cross the interface as a function of the order parameter z. Interestingly this quantity is not readily accessible from current experimental techniques, leaving computer simulations as the gold standard. The use of enhanced-sampling techniques, such as umbrella sampling, offer the means to compute the PMF at an atomistic resolution and gather unprecedented insight [289,165]. Unsurprisingly the calculation of G(z) is tremendously difficult to converge: approximately 10 5 CPU-hours is required for a small rigid molecule crossing a singlecomponent lipid membrane using explicit-solvent atomistic models. This unfortunately limits an atomistic throughput to ∼ 10 molecules per study [290,291,292,293].
Here again, CG models allow for a significant step up in the number of compounds that can be screened. Beyond the reduced representation speeding up convergence of each simulation, the mapping to a Martini representation easily leads to large numbers of compounds (Section 2.3). Menichetti et al. reported the PMFs of 4.6·10 5 small molecules in a one-component 1,2-dioleoyl-sn-glycero-3-phosphocholine (DOPC) membrane [294]. This collection of compounds resulted from the exhaustive screening of all CG small molecules made of one and two neutral Martini beads, 14 and 105, respectively. The resulting set of PMFs showed a strict variety, which could be accurately correlated to the water/octanol partitioning of the solute-a bulk quantity to relate to structural features at the membrane interface. The mapping between chemistry and CG representations was established by coarse-graining subsets of the GDB [18], keeping compounds that mapped to one-and two-bead representations.
A follow-up study extended the screening from PMFs to the permeability coefficient (2) [64]. The CG simulations did not inform the diffusivity term (problematic due to inconsistent accelerations of the CG dynamics [295]), but were instead taken from atomistic simulations, indicating weak dependence on the solute's chemistry [290]. The results showed excellent agreement with atomistic simulations and correlation with experiments, despite the minimalistic modeling approach. Permeability coefficients were predicted for 5.1 · 10 5 small organic molecules. Projecting the permeability surface onto two physically motivated descriptors (hydrophobicity and acidity, i.e., pK a ) highlighted the localization of key chemical groups, and their influence on the target property. It also challenged earlier phenomenological models of solute permeation [296].
A further scale up in the number of compounds "simply" comes down to a broader screening toward larger CG representations: from one-and two-bead constructs to more. The combinatorics of the Martini bead types, while more favorable than atomicallydetailed chemistry, still grow exponentially: 14, 105, 1470, and 19 306 for one-to four-bead constructs-only considering linear chains. Instead of an exhaustive account, Hoffmann et al. presented an importance-sampling scheme to navigate the space of compounds [135]. A Metropolis-chain Monte Carlo scheme was devised by daisychaining compounds via alchemical transformations, and using the relative free energy in the Metropolis criterion. This led to a large network of compounds sampled, and the use of closed thermodynamic cycles allowed for small corrections to the free energies. The space of compounds that was not sampled was subsequently predicted using a simple kernel-based ML model. Some of the predictions were explicitly validated, but all followed simple linear relationships between transfer free energies that had been identified for the smaller compounds [294]-the thermodynamics of the system acted as an ML physical constraint global to the compound dataset. Overall it boosted the prediction of transfer free energies to 1.3 · 10 6 small organic molecules.
Extending the high-throughput CG framework, compound screening can be used to better understand differential stabilization between lipid domains, as a proxy for small molecules modulating complex multi-component lipid membranes [297]. The difference in PMF minima between the relevant environments stands as a computationally appealing proxy for large-scale simulations of membrane reorganization. The results could identify families of compounds that could induce membrane mixing or demixing. Compound screening and their effect on membrane thermodynamics may help us better understand the mechanism of action of certain anesthetics [298].

Outlook
The path toward in silico compound screening of biomaterials and soft materials seems clear, but still contains a number of important hurdles before reaching large datascale regimes. Automating the preparation, parametrization, and analysis of molecular dynamics (MD) simulations is necessary to reach a high throughput, and has largely embodied the scope of this review. The other critical aspect is our capacity to run enough MD simulations, clearly the main bottleneck. In this sense, coarse-grained (CG) modeling has an important role to play: its ability to emulate a complex systems with fewer degrees of freedom offers a significant scale-up in the context of screening. The added capability to reduce the size of chemical space seems to be a promising way to ease the analysis and extraction of structure-property relationships.
Beyond statics, in silico compound screening will likely hold essential to target dynamical properties, such as mean-first passage times, folding and nucleation rates, or even aging dynamics. To achieve this, force-field methods need to improve the modeling of dynamics-a statement that holds at all scales, though in particular at the CG level. The perspective to move toward non-equilibrium systems will require the means to incorporate processing effects in materials, leading to structure-processproperty relationships. Getting there will be challenging: non-equilibrium systems have no well-defined free-energy surface, and they critically depend on how the system is prepared [23].
Last, compound screening needs tighter integration with experiments. This is not only in light of verifying the in silico predictions, but a collaborative procedure between simulations and experiment that is poised to further accelerate soft-materials discovery.
Acknowledgments I thank Andrew L. Ferguson and Joseph F. Rudzinski