Rather than speeding up the whole SPME algorithm in hardware, an alternative is to only speedup the 3DFFT operation in a dedicated high performance FFT coprocessor. This alternative architecture is shown in Figure 53.
Figure 53  CPU with FFT Coprocessor
To investigate the speedup potential of the FFT coprocessor architecture, the SPME computational complexity estimation in Table 2 is used. The SPME computational complexity is summarized in Equation 26 for easy reference.
Equation 26  SPME Computational Complexity (Based on Table 2)
To determine the maximum achievable speedup of the FFT coprocessor architecture, the infamous Amdahl’s Law [40] is applied by assuming the FFT coprocessor can compute the 3DFFT and the 3DIFFT in zero time. Based on Equation 26, Table 20 shows the number of computation cycles involved in the SPME computation with and without the 3DFFT operation. Furthermore, the table also shows the degree of speedup of the FFT coprocessor architecture for different grid sizes K. In generation of the speedup data, it is assumed that the number of particles N is one half of the total number of grid points K×K×K and the interpolation order P equals to 4. As shown in the table, if only the FFT operation is accelerated (to have a computation cycle of zero), the degree of speedup for the FFT coprocessor architecture is still very insignificant (< 2x). The main reason is that the BSpline calculation, the mesh composition, the energy calculation, and the force calculation cannot be overlapped in a CPU and they constitute a large portion of the SPME computation time. Therefore, to accelerate the SPME computation, all major computation steps should be accelerated.
Table 20  Speedup Potential of FFT Coprocessor Architecture
K

N

P

Total SPME Cycles

3DFFT Cycles

Total SPME Cycle
(Without 3DFFT)

Speedup

8

256

4

8.45E+0484480

9.22E+039216

7.53E+0475264

1.12244912

16

2048

4

7.00E+05700416

9.83E+0498304

6.02E+05602112

1.163265

32

16384

4

5.80E+065799936

9.83E+05983040

4.82E+064816896

1.204082

64

131072

4

4.80E+0747972352

9.44E+069437184

3.85E+0738535168

1.244898

128

1048576

4

3.96E+08396361728

8.81E+0788080384

3.08E+08308281344

1.2985714

With a lower bound speedup of 3x, the speedup ability of the singleQMM RSCE is not significant. This is due to the lack of parallelism in the SPME algorithm and also due to the limited QMM memory access bandwidth. Although the impact of limited QMM bandwidth is mitigated with the usage of more QMM memories, the lack of parallelism in the SPME algorithm still bottlenecks the RSCE speedup. Furthermore, it is obvious that the Standard Ewald Summation [28, 33] is easier to implement and is also easier to parallelize [28, 33]. This leads to a question on why the SPME algorithm should be implemented in hardware instead of the Standard Ewald Summation. This question can be answered by the plot in Table 21 and Figure 54.
In the plot in Figure 54, a negative Log(Speedup) value means there is a slowdown to use the RSCE to perform the forces and energy calculations; while a positive value means a speedup. The plot uses the RSCE computation clock cycle estimation described in Equation 25 and simply takes N^{2} as the clock estimation for Standard Ewald Summation. The plot shows that as the number of particles N increases to a threshold point N_{threshold} (the zerocrossing point in the graph), the intrinsic O(N×Log(N)) RSCE implementation starts to provide significant speedup against the O(N^{2}) Ewald Summation algorithm. The threshold N_{threshold} increases as the grid size K and interpolation order P increases. The relationship of N_{threshold} and the interpolation order P is illustrated in Figure 55. Table 21 shows the threshold points for different simulation settings. As shown in the table, the RSCE starts to provide speedup against the Ewald Summation when the number of particles N is fairly small.
Table 21  RSCE Speedup against Ewald Summation
Simulation Setting

Minimum N to Observe a Speedup (N_{threshold})

P=4, K=32

700

P=4, K=64

3500

P=8, K=64

5000

P=4, K=128

7000

P=4, K=256

13000

Figure 54  SingleQMM RSCE Speedup against N^{2} Standard Ewald
Figure 55  Effect of P on SingleQMM RSCE Speedup
Figure 56 plots the RSCE speedup against the Ewald Summation when the number of particles N is of the same order as the number of grid points K×K×K. This eliminates the speedup dependency of the varying N and varying K. As observed in the plot, the N_{threshold }increases with increasing interpolation order P. Table 22 summarizes the result.
Table 22  RSCE Speedup against Ewald Summation (When K×K×K = ~N)
Interpolation Order P

N_{threshold}

4

200

8

1584

12

5012

Figure 56  RSCE Speedup against the Ewald Summation
(When K×K×K is of the same order as N)
Share with your friends: 