In addition to their ability to dramatically extend timescales and parallelize dynamics simulations to very high levels of scalability (as outlined in Section 1.2.3), MSMs are also a powerful means to analyze trajectories. However, MSMs with hundreds of thousands of states are not useful for giving human insight--even if such a model is quantitatively useful to compute specific properties of interest. To address this issue, we will reduce the precision of an MSM to a cognitively manageable number of states (e.g. 10). Unfortunately, these simplified models usually violate the Markov property of MSMs, since the inter-state jumps are too large. However, we can use recent advances in statistical methods to go beyond first order Markov models and generate valid coarse-grain MSMs [19, 28]. We will use recent advances in high-order Markov chain modeling to build hierarchical, multi-resolution models that maintain the Markov property [26]. The parameter space of a higher-order Markov chain grows very quickly, with an N state, order r chain having on the order of Nr parameters. Thus, we have defined mixed-order Markov models as a parsimonious alternative to a full high-order specification. In these models, different states exhibit different degrees of memory [3].
1.5.3 Topological Mode Analysis
Unsupervised learning or clustering is an important tool for understanding and interpreting trajectory data. In general, obtaining the most natural clustering is an ill-posed problem and is particularly difficult with massive and high-dimensional data sets arising in simulations, where visualization techniques often fail. Clustering is often posed as estimating a density function f from the samples. A vast literature exists on clustering algorithms, including single-linkage, k-means, spectral clustering, and others [29]. The mode-seeking approach consists of detecting the local peaks of f in order to use them as cluster centers and to partition the data set according to their basins of attraction. The precise notion of the basin of attraction Bp of a peak p may vary, but generally Bp corresponds to the subset of the data points that eventually reach p by some criteria [30].
A common issue faced by these methods is that gradients of a high-dimensional density function f are notoriously unstable and their approximation from a density estimator f’ can lead to unpredictable results. We propose an approach that consists of detecting and merging unstable clusters, guaranteeing stability, and providing a principled approach to determining the number of clusters. Our method depends on estimates of the prominence of the density peaks and builds a hierarchy on these peaks. The prominence of a peak is defined as the difference between its height and the level at which its basin of attraction, Bp, meets one of a higher peak (its parent in the hierarchy). With this definition, a small bump occurring at a high density will have large absolute height but small prominence [31]. The prominence of a density peak, p, can be used to create a persistence diagram (PD) of f, illustrated in Figure 1.7(c). The key insight is that the PD reveals the key features of the unknown density function, and highlights peaks that are the most prominent. Thus, with only limited knowledge of the underlying space and a finite sampling of the function we can provably approximate the PD of the underlying function.
We will implement the proposed method, ToMATo (Topological Mode Analysis Tool), combining graph-based hill-climbing algorithms with a cluster merging step guided by persistence. It only uses distances between samples and does not need other coordinates. As illustrated in Figure 1.7(b), hill-climbing is very sensitive to perturbations of the density function f that arise from mode finding. Computing the PD of f’ enables us to separate the prominent peaks of f’ from the inconsequential ones. In Figure 1.7(c) for instance, we can see two points which are clearly further from the diagonal than other points: these correspond to the two prominent peaks of f’. To obtain the final clusters, we merge every cluster of prominence less than a threshold parameter τ into its parent cluster in the persistence hierarchy. The PD gives us a precise understanding of the relationship between the prominent cutoff and the number of clusters, as illustrated in Figures 1.7(c) and 1.7(d). ToMATo is highly generic and is agnostic to the choice of distance, underlying graph and the density estimator. The algorithm is generally applicable, and highly efficient (near-linear) [18].
As a test of ToMATo on molecular simulation data, we used it to cluster conformations of an alanine-dipeptide molecule. The data consists of short trajectories of conformations coming from atomistic simulations of this small protein. The input was 1000 configurations with 0.1 ps spacing between the samples [32]. For our experiments, we took the trajectories and treated them as 192,000 samples. The distance metric used was root-mean-squared deviation (RMSD). The point samples and distance function were the only two inputs to the clustering algorithm.
Share with your friends: |