The SAMPL6 SAMPLing challenge: Assessing the reliability and efficiency of binding free energy calculations

Abstract

Approaches for computing small molecule binding free energies are now regularly being employed by academic and industry practitioners to study receptor-ligand systems and prioritize the synthesis of small molecules for ligand design. Given the variety of methods and implementations available, it is natural to ask how these methods compare to each other in terms of final predictions and convergence rates.Here, we describe the conceptualization and results for the first SAMPLing challenge from the SAMPL series focusing on the assessment of convergence properties and reproducibility of binding free energy methodologies. We provided parameter files and multiple initial geometries for two octa-acid (OA) and one cucurbit[8]uril (CB8) host-guest systems, for which it is computationally feasible to obtain converged binding affinity estimates in a matter of hours or a few days. Participants submitted binding free energy predictions as a function of the computational effort for six different alchemical- and physical-pathway (e.g. molecular dynamics and potential of mean force) methodologies based on GROMACS, AMBER, and OpenMM implementations.For the two small OA binders, the free energy estimates computed with alchemical and potential of mean force approaches show relatively similar variance and bias as a function of the number of energy/force evaluations, with the attach-pull-release (APR) and GROMACS expanded ensemble methodologies performing particularly well. The differences between the methods widen when analyzing the CB8-quinine system, where both the guest size and correlation times are greater. For this system, coupled topologies non-equilibrium switching (CP-NS) obtained the overall highest efficiency followed by Hamiltonian replica exchange (HREX). Among the conclusions emerging from the data, we found that CP-NS convergence can be enhanced by increasing the length of the non-equilibrium protocol, that HREX, while displaying very small variance, can incur into substantial bias that depends on the initial population of the replicas, and that the Berendsen barostat introduces non-negligible artifacts in expanded ensemble simulations. Surprisingly, the results suggest that specifying the forcefield parameters and charges is insufficient to ensure reproducibility to better than ~ 0.5 kcal/mol. Further work will be required to identify the exact source of these discrepancies.

Supplementary notes can be added here, including code and math.