Up to now, when performing the calculations for the Coulombic interactions under the periodic boundary condition, all hardware MD systems [23, 28] chose to implement and parallelize the Ewald Summation. There is no significant effort to implement and parallelize the SPME algorithm in hardware. During the development stage of this thesis, there are several observations that may help explain this trend.
Firstly, the SPME algorithm itself is more difficult to implement in hardware than the Ewald Summation. Although the calculations for the direct sum in the SPME algorithm and in the Ewald Summation are identical, the difference is in the reciprocal sum calculations. The reciprocal sum calculation in the SPME is difficult because it involves some complex mathematical operations such as the B-Spline interpolation and the 3D-FFT. In addition to their implementation difficulty, these mathematical operations also complicate the analytical precision analysis on the fixed-point arithmetic typically used in hardware.
Secondly, the SPME algorithm is a sequential algorithm in which the energy and forces are computed with a sequence of calculation steps. Although several calculation steps can be overlapped, this sequential characteristic of the SPME algorithm still prohibits the full exposure to the FPGA parallelization capability. One of the main reasons that the hardware implementation of the SPME algorithm needs to be sequential is that in each calculation step, the QMM grid needs to be updated before the next step can proceed. For example, the 3D-IFFT operation cannot be started before the MC block finishes composing the QMM grid.
Thirdly, on a system-level view, if the reciprocal sum calculation is parallelized to multiple FPGAs, the data communication requirement of the SPME algorithm is more demanding than that of the Ewald Summation. With the Ewald Summation, a replicated data scheme (that is each FPGA holds the information for all N particles) can eliminate the data communication requirement during the energy and force calculation. However, with the SPME algorithm, there is no obvious way to eliminate the data communication requirement. The reason is that the data communication among multiple FPGAs is needed to exchange the updated QMM grid information in the intermediate steps.
Lastly, while the opportunity of parallelization in the Ewald Summation is limited by the number of particles N, the opportunity of parallelization in the SPME algorithm is limited by the grid size K or the grid plane size K×K. The reason is that the FFT operation in the SPME algorithm is more efficient when each FPGA performs the FFT for a row. Therefore, the grid size K limits the maximum number of FPGAs to be used in parallelizing the SPME calculations. Table 23 shows the relationship between the grid size K and the maximum number of FPGA to be used. For a large grid size K, if each RSCE performs the FFT for a row during the 3D-FFT operation, the maximum number of RSCEs is well over ten thousands. It would be impractical and unrealistic to build a hardware system with more than ten thousands RSCEs.
Table 23 - Maximum Number of RSCEs Used in Parallelizing the SPME Calculation
Although there are limitations and difficulties in implementing and parallelizing the SPME algorithm in hardware, the main justification for implementing the SPME algorithm is the intrinsic N×Log(N) complexity of the SPME algorithm. Although this complexity benefit will be offset by the limited parallelization opportunity and data communication requirement when the number of FPGAs used increases to a certain threshold NumPthreshold, this threshold number of FPGAs is very high.
Figure 57 shows the effect of increasing the number of FPGAs on the speedup of the multi-RSCE system against the multi-FPGA Ewald Summation system. As observed in the figure, as the number of FPGAs used increases, the speedup provided by the multi-RSCE system decreases. The reason is that the data communication time, which increases with NumP, starts to takes its toll on the overall multi-RSCE performance. The zero-crossing point in the graph represents the threshold number of FPGAs NumPthreshold when the multi-FPGA Ewald Summation system starts to outperform the multi-RSCE system.
Table 24 summarizes the thresholds NumPthreshold for different grid sizes. For a grid size K=32, the NumPthreshold is 4000 and for a grid size of K=64, the NumPthreshold is as large as 33000. The threshold number of FPGAs increases with increasing grid size. Although, due to the implementation simplicity, more calculation pipelines should be able to be fitted into an Ewald Summation FPGA, the threshold number of FPGAs is still too large for any grid size of larger than 64,. This suggests that for any realistic hardware system, the multi-RSCE system would outperform the multi-FPGA Ewald Summation system.
Table 24 - Threshold Number of FPGAs when the Ewald Summation starts to be Faster
Figure 57 - RSCE Parallelization vs. Ewald Summation Parallelization
In plotting the graph in Figure 57, several assumptions have been made. Firstly, the total number of grid points is of the same order as N , this provides a fair basis to compare the two algorithms. Secondly, the maximum number of FPGAs used in the SPME hardware is limited by the number of rows in the charge grid. When the number of FPGAs increases beyond the number of rows, it is assumed that the extra FPGAs will be left idle during the 3D-FFT processes and they only will participate during the mesh composition and the force calculation stages. Thirdly, it is assumed there is no data communication requirement for the multi-FPGA Ewald Summation system, that is, the usage of data replication is assumed for the Ewald Summation hardware. Fourthly, it is assumed that the communication backbone for the multi-RSCE system is non-blocking and has a transfer rate of 2GBit/s. Lastly, in plotting the graph, the communication time and the computation time of the multi-RSCE system are estimated with Equation 27 and Equation 28.
Equation 27 - Communication Time of the Multi-RSCE System
Equation 28 - Computation Time of the Multi-RSCE System
The derivation of the equations stems from Section 3.10 of this thesis. As described in Section 3.10, there are six synchronization and communication points for the multi-RSCE operation. In two of them, only the real part of the grid (after the MC step and before the FC step) needs to be exchanged among the RSCEs, and in the remaining four of them, both the real and the imaginary part of the grid need to be exchanged. Therefore, there are altogether (10×KX×KY×KZ×PrecisionGrid) bits of data to be exchanged after each synchronization point and each RSCE should theoretically need (10×KX×KY×KZ×PrecisionGrid)/NumP bits of data for the mini-mesh or for the FFT rows it is responsible for. Furthermore, assuming that the RSCEs are organized in a 2D mesh with nearest-neighbor connection and the multi-RSCE system has intelligence to deliver data to any RSCE without blocking; the number of transmission hops for the RSCE to send a request and receive the data can be approximated to be 2×√NumP. Therefore, the communication time can be estimated with Equation 27. On the other hand, Equation 28 is derived from the information shown in Table 14 of Section 3.10.
After the different aspects of the RSCE speedup is examined, the next chapter discusses the verification and the validation effort of the RSCE implementation.