Rather than speeding up the whole SPME algorithm in hardware, an alternative is to only speedup the 3D-FFT operation in a dedicated high performance FFT co-processor. This alternative architecture is shown in Figure 53.
Figure 53 - CPU with FFT Co-processor
To investigate the speedup potential of the FFT co-processor architecture, the SPME computational complexity estimation in Table 2 is used. The SPME computational complexity is summarized in Equation 26 for easy reference.
Equation 26 - SPME Computational Complexity (Based on Table 2)
To determine the maximum achievable speedup of the FFT co-processor architecture, the infamous Amdahl’s Law  is applied by assuming the FFT co-processor can compute the 3D-FFT and the 3D-IFFT in zero time. Based on Equation 26, Table 20 shows the number of computation cycles involved in the SPME computation with and without the 3D-FFT operation. Furthermore, the table also shows the degree of speedup of the FFT co-processor architecture for different grid sizes K. In generation of the speedup data, it is assumed that the number of particles N is one half of the total number of grid points K×K×K and the interpolation order P equals to 4. As shown in the table, if only the FFT operation is accelerated (to have a computation cycle of zero), the degree of speedup for the FFT co-processor architecture is still very insignificant (< 2x). The main reason is that the B-Spline calculation, the mesh composition, the energy calculation, and the force calculation cannot be overlapped in a CPU and they constitute a large portion of the SPME computation time. Therefore, to accelerate the SPME computation, all major computation steps should be accelerated.
Table 20 - Speedup Potential of FFT Co-processor Architecture
With a lower bound speedup of 3x, the speedup ability of the single-QMM RSCE is not significant. This is due to the lack of parallelism in the SPME algorithm and also due to the limited QMM memory access bandwidth. Although the impact of limited QMM bandwidth is mitigated with the usage of more QMM memories, the lack of parallelism in the SPME algorithm still bottlenecks the RSCE speedup. Furthermore, it is obvious that the Standard Ewald Summation [28, 33] is easier to implement and is also easier to parallelize [28, 33]. This leads to a question on why the SPME algorithm should be implemented in hardware instead of the Standard Ewald Summation. This question can be answered by the plot in Table 21 and Figure 54.
In the plot in Figure 54, a negative Log(Speedup) value means there is a slow-down to use the RSCE to perform the forces and energy calculations; while a positive value means a speedup. The plot uses the RSCE computation clock cycle estimation described in Equation 25 and simply takes N2 as the clock estimation for Standard Ewald Summation. The plot shows that as the number of particles N increases to a threshold point Nthreshold (the zero-crossing point in the graph), the intrinsic O(N×Log(N)) RSCE implementation starts to provide significant speedup against the O(N2) Ewald Summation algorithm. The threshold Nthreshold increases as the grid size K and interpolation order P increases. The relationship of Nthreshold and the interpolation order P is illustrated in Figure 55. Table 21 shows the threshold points for different simulation settings. As shown in the table, the RSCE starts to provide speedup against the Ewald Summation when the number of particles N is fairly small.
Figure 56 plots the RSCE speedup against the Ewald Summation when the number of particles N is of the same order as the number of grid points K×K×K. This eliminates the speedup dependency of the varying N and varying K. As observed in the plot, the Nthreshold increases with increasing interpolation order P. Table 22 summarizes the result.