According to Table 15, the total computation time can be estimated with Equation 23. The computation time tMC and tFC scale with the values N and P; while the computation time t3D-IFFT and t3D-FFT scale with the value K.
Equation 23 - Total Computation Time (Single-QMM RSCE)
6.3.1.Speedup with respect to a 2.4 GHz Intel P4 Computer
This section provides an estimate on the level of speedupdegree of speedup the RSCE can provide in the SPME reciprocal sum computations. Since the RSCE is expected to be used to accelerate MD simulation of molecular systems that contain at most tens of thousands of particles, the estimate provided in this section is limited to a maximum N=20000. To provide the estimate, the single-timestep computation time of the better implemented RSCE (assuming it is running at 100MHz*) is compared with that of the SPME software implementation  running in a 2.4 GHz P4 computer with 512MB DDR RAM. Table 16 shows the RSCE speedup for MD simulations when the number of particles N=2000 and N=20000.
*Note: Current RSCE implementation (without the MicroBlaze softcore) in the Xilinx XCV2000-4 has a maximum clock frequency of ~75MHz.
Table 16 - Speedup Estimation (RSCE vs. P4 SPME)
As shown in Table 16, the RSCE speedup ranges from a minimum of 3x to a maximum of 14x. Furthermore, it is observed that the RSCE provides a different level of speedupdegree of speedup depending on the simulation setting (K, P, and N). One thing worthwhile to notice is that when N=2000, P=4, and K=128, the speedup for the MC step reaches 42x. One of the reasons for this high level of speedupdegree of speedup is that when grid size K=128, the number of elements in the Q charge array would be 128×128×128=2097152; this large array size limits the performance advantage of using cache in a computer.
The speedup increases with the increasing grid size K. The reason is that the RSCE fixed-point 3D-FFT block provides a substantial speedup (>8x) against the double precision floating-point 3D-FFT subroutine in the software. Furthermore, since the EC step in the RSCE is overlapped with the 3D-FFT step, it theoretically costs zero computation time. Therefore, the EC step, which scales with K, also contributes to the speedup significantly when the grid size K is large. On the other hand, the speedup decreases with increasing number of particles N and interpolation order P. The reason is that when N and P is large, the workload of the MC step starts to dominate and the speedup obtained in the 3D-FFT step and the EC step is averaged downward by the speedup of the MC step (~2x when N is large). That is, the MC workload is dominant when the imbalance situation described in Equation 24 happens.
Equation 24 - Imbalance of Workload
Hence, it can be concluded that when N×P×P×P is large relative to the value of K×K×K, the speedup bottleneck is in the MC step. It seems obvious that more calculation pipelines in the MC step can mitigate this bottleneck. Unfortunately, the QMM bandwidth limitation defeats the purpose of using multiple MC calculation pipelines. More discussion on the characteristics of the RSCE speedup and the relationship among the values of K, P and N is provided in Section 4.4.
Comparing with the other hardware accelerators [23, 26], the speedup of the RSCE is not very significant. There are two main factors that are limiting the RSCE from achieving a higher level of speedupdegree of speedup; they are the limited access bandwidth of the QMM memories and the sequential nature of the SPME algorithm. In all the time consuming steps (e.g. MC, 3D-FFT, and FC) of the reciprocal sum calculations, the QMM memories are accessed frequently. Therefore, to mitigate the QMM bandwidth limitation and to improve the level of speedupdegree of speedup, a higher data rate memory (e.g. Double Data Rate RAM) and more sets of QMM memories can be used.
6.3.2.Speedup Enhancement with Multiple Sets of QMM Memories
To use multiple sets of QMM memories to increase the RSCE to QMM memory bandwidth, design modifications are inevitable in the MC, the 3D-FFT, and the FC design block. To help explain the necessary modifications, let’s assume that the interpolation order P is 4, the grid size K is 16, and that four QMMR and four QMMI memories are used (i.e. NQ = 4).
Firstly, during the MC step, for each particle, the RSCE performs Read-Modify-Write (RMW) operations using all four QMMR simultaneously. Therefore, four MC calculation pipelines are necessary to take advantage of the increased QMM bandwidth. For example, for P=4, there are 4×4×4 = 64 grid points to be updated. Therefore, each QMMR memory will be used for updating 16 grid points. After all particles have been processed, each QMMR memory holds a partial sum of the mesh. Hence, a global summation operation can be performed such that all four QMMR memories will hold the final updated mesh. With four QMMR memories, the mesh composition would take approximately (N×P×P×P×2)/4 + 2×K×K×K clock cycles to finish; the second term is for the global summation. That is, it takes K×K×K clock cycles to read all grid point values from all the QMMs and it takes another K×K×K clock cycles to write back the sum to all QMM memories. For a large number of particles N, an additional speedup close to 4x can be achieved. The maximum number of the QMM memories to be used is limited by the number of FPGA I/O pins and the system cost.
In addition to speeding up the MC step, using multiple QMM memories can speedup the 3D-FFT step as well. In the multiple QMM memories case, each set of the QMMR and the QMMI is responsible for (K×K)/4 rows of the 1D-FFT. Hence, to match the increased QMM memories access bandwidth, four K-point FFT LogiCores should be instantiated in the 3D-FFT design block. After each pass of the three 2D-FFT transformations is finished, a global data communication needs to take place such that all QMM memories get the complete 2D-FFT transformed mesh. This process repeats until the whole mesh is 3D-FFT transformed. By using four sets of QMMR and QMMI, the 3D-FFT step takes 3×(K×K× (TROW_TRANS+ K)/4 + 2×K×K×K) clock cycles to complete. The second term represents the clock cycles the RSCE takes to perform the global communication.
Lastly, the force computation step can also be made faster using multiple sets of the QMM. After the 3D-FFT step, each of the four QMMR should hold the updated grid point values for calculating the directional forces. For each particle, the RSCE shares the read accesses for the grid point values among all four QMMR memories; this effectively speeds up the force computations by a factor of four. Table 17 and Equation 25 show the estimated computation time for the RSCE calculation when multiple sets, NQ, of the QMM memories are used.
Table 17 - Estimated Computation Time (with NQ-QMM)
Equation 25 - Total Computation Time (Multi-QMM RSCE)
Based on Table 17 and Equation 25, the usage of multiple QMMs should shorten the RSCE computation time the most when the grid size (K) is small, the number of particles (N) is large, and the interpolation order (P) is high. The speedup plots of using four QMMs versus one QMM are shown in Figures 49, 50 and 51. In Figures 49 and 50, it is shown that a smaller value of grid size K favors the multi-QMM speedup. As an example, when K=32, a speedup of 2x realized when N=2.0x103; while for K=128, the same speedup is realized at a much larger N=1.30x105. The reason for this behavior is that the required global summation starts to take its toll when the grid size K is a large number.
Figure 49 - Speedup with Four Sets of QMM Memories (P=4).
Figure 50 - Speedup with Four Sets of QMM Memories (P=8).
On the other hand, as shown in Figure 51, a higher interpolation order favors the multi-QMM speedup. For a grid size K=64, when P=4, the usage of multiple QMMs provides a 3x speedup when N=1x105; while for P=8, the same speedup happens when N=1.0x104.
Figure 51 - Speedup with Four Sets of QMM Memories (P=8, K=32).
Download 1.53 Mb.
Share with your friends:
The database is protected by copyright ©ininet.org 2020