Dynamical Fragments - Methods
We generated fragment libraries from 2 ns snapshots of 298K simulations from the Dynameomics database. Each snapshot was fragmented using a sliding window along the entire length of the protein. The total number of fragments (and other characteristic information) for each library (a library consisting of fragments of a given length) are listed in Table 1. Seven libraries were generated, one for each fragment length of 3-9 residues, designated F3-F9.
| Table 1. Summary Statistics for Individual Fragment Libraries
|
|---|
| Length
| Fragments
| Bin 1
| Bin 2
| Bins
| Avg Bin Size
| Max Bin Size
| Max Bin
|
|---|
| 3
| 1,349,622
| Cβ 1 - C 5
| O 1 - N 5
| 1,659
| 814
| 19,515
| 6.9, 3.3
|
| 4
| 4,651
| Cβ 1 - Cβ 5
| O 1 - N 5
| 1,358,141
| 292
| 6,481
| 5.6, 3.2
|
| 5
| 1,347,280
| Cβ 1 - Cβ 5
| O 1 - N 5
| 7,856
| 171
| 11,333
| 6.1, 2.8
|
| 6
| 1,317,474
| Cβ 1 - Cβ 5
| O 1 - N 5
| 11,103
| 119
| 9,675
| 9.6, 4.9
|
| 7
| 1,306,883
| Cβ 1 - Cβ 5
| O 1 - N 5
| 14,236
| 92
| 5,882
| 10.8, 6.8
|
| 8
| 1,296,562
| Cβ 1 - Cβ 5
| O 1 - N 5
| 17,429
| 74
| 5,021
| 10.7, 7.7
|
| 9
| 1,286,106
| Cβ 1 - Cβ 5
| O 1 - N 5
| 20,658
| 62
| 4,079
| 12.4, 8.8
|
For our libraries, the fragments contain coordinates for N, Cα, Cβ, N, and O atoms for each residue. Fragments clustered into certain structural motifs (particularly alpha-helix and beta sheet), so we sought to reduce the number of fragments without losing diversity. For this, we used a two tiered clustering technique.
Internal Distance Binning
First, we calculated inter-atomic distances for each atom pair. We then ran principal components analysis on all distances to find those that were most descriptive of the fragments. For all fragment lengths above 3, the top two distances were always N to C terminal atom pairs -- Cβ-Cβ and O-N (Cβ-C and O-N for length 3). We used these distances as a two-dimensional bin (at 0.1A) to create an initial separation of the fragments. Statistics about the bin populations are included in Table 1. The populations for length 5 fragments are in Figure 1 and are described for all fragments in Methods: Binning.
Figure 1a. Bin Populations for Library F5 (5-Residue Fragments).
Figure 1b. Average In-Bin RMSD for Library F5 (5-Residue Fragments).
Clustering Algorithm
The helical and extended regions contained redundant fragments--up to 100 times more fragments than the average bin--increasing the search time through that bin without improving quality. We reduced these sets using a modified K-Means clustering algorithm. First, pairwise RMSDs of N, Cα, Cβ, C, and O atoms were measured for all fragments within a given bin. The K-Means algorithm was seeded with a predetermined number of fragments. The first fragment was that having the lowest average RMSD to all other fragments and subsequent fragments were iteratively chosen having the highest minmum RMSD to all other seed fragments. That is, seed fragments are chosen to not be close to any other seed, maximizing diversity by capturing all rare fragments. All other fragments are then reassigned to clusters. The following steps then iterate to convergence. A fragment having the lowest average RMSD to the rest of the cluster is chosen as the cluster center. The previous two steps are repeated until convergence. The final cluster centers are inserted into the condensed library.
Clustering of Sample Bins
To select the target bin size for each library, we selected points in high-population regions of each library and ran a stress test. For several target bin sizes (between 2 and 5000 fragments) we clustered the fragments at each point to that target and measured the average rmsd within the clusters.
For F6-F9, most bins contained less than 200 fragments. The exceptions were the α-helical populations, which were less than our threshold or 0.25 Å stress at 200 fragments. For F3-F5, we created libraries at both 200 and 500 fragments per bin. In benchmarks (described below) we found little performance advantage to 500 fragments, so selected 200 fragments for our condensed libraries. Results are shown in Figure 2 and in
Methods: Stress.
Figure 2a. Intra-Cluster Stress for Length 5 Fragments.
Figure 2b. Bin Populations for Condensed Library F5 (5-Residue Fragments).
Figure 2c. Average In-Bin RMSD for Condensed Library F5 (5-Residue Fragments).
Benchmarking Cluster Performance
Finally, to ensure our reduced libary maintained diversity, we benchmarked the clusters against an external data set. 1,000 fragments were taken from proteins in Astral 40 and matched against our original and condensed libraries. Performance was measured by mean RMSD of the best match. Benchmarks for all fragments are available in
Methods: Benchmarks.
Figure 3a. Bin Populations for Library F5 (5-Residue Fragments).
Figure 3b. Average In-Bin RMSD for Library F5 (5-Residue Fragments).