Up to now, when performing the calculations for the Coulombic interactions under the periodic boundary condition, all hardware MD systems [23, 28] chose to implement and parallelize the Ewald Summation. There is no significant effort to implement and parallelize the SPME algorithm in hardware. During the development stage of this thesis, there are several observations that may help explain this trend.
Firstly, the SPME algorithm itself is more difficult to implement in hardware than the Ewald Summation. Although the calculations for the direct sum in the SPME algorithm and in the Ewald Summation are identical, the difference is in the reciprocal sum calculations. The reciprocal sum calculation in the SPME is difficult because it involves some complex mathematical operations such as the BSpline interpolation and the 3DFFT. In addition to their implementation difficulty, these mathematical operations also complicate the analytical precision analysis on the fixedpoint arithmetic typically used in hardware.
Secondly, the SPME algorithm is a sequential algorithm in which the energy and forces are computed with a sequence of calculation steps. Although several calculation steps can be overlapped, this sequential characteristic of the SPME algorithm still prohibits the full exposure to the FPGA parallelization capability. One of the main reasons that the hardware implementation of the SPME algorithm needs to be sequential is that in each calculation step, the QMM grid needs to be updated before the next step can proceed. For example, the 3DIFFT operation cannot be started before the MC block finishes composing the QMM grid.
Thirdly, on a systemlevel view, if the reciprocal sum calculation is parallelized to multiple FPGAs, the data communication requirement of the SPME algorithm is more demanding than that of the Ewald Summation. With the Ewald Summation, a replicated data scheme (that is each FPGA holds the information for all N particles) can eliminate the data communication requirement during the energy and force calculation. However, with the SPME algorithm, there is no obvious way to eliminate the data communication requirement. The reason is that the data communication among multiple FPGAs is needed to exchange the updated QMM grid information in the intermediate steps.
Lastly, while the opportunity of parallelization in the Ewald Summation is limited by the number of particles N, the opportunity of parallelization in the SPME algorithm is limited by the grid size K or the grid plane size K×K. The reason is that the FFT operation in the SPME algorithm is more efficient when each FPGA performs the FFT for a row. Therefore, the grid size K limits the maximum number of FPGAs to be used in parallelizing the SPME calculations. Table 23 shows the relationship between the grid size K and the maximum number of FPGA to be used. For a large grid size K, if each RSCE performs the FFT for a row during the 3DFFT operation, the maximum number of RSCEs is well over ten thousands. It would be impractical and unrealistic to build a hardware system with more than ten thousands RSCEs.
Table 23  Maximum Number of RSCEs Used in Parallelizing the SPME Calculation
Grid Size K
(Assume orthogonal box)

Max. # of RSCEs
(Each RSCE takes a plane
during FFT)

Max. # of RSCEs
(Each RSCE takes a row
during FFT)

32

32

1024

64

64

4096

128

128

16384

256

256

65536

Although there are limitations and difficulties in implementing and parallelizing the SPME algorithm in hardware, the main justification for implementing the SPME algorithm is the intrinsic N×Log(N) complexity of the SPME algorithm. Although this complexity benefit will be offset by the limited parallelization opportunity and data communication requirement when the number of FPGAs used increases to a certain threshold NumP_{threshold}, this threshold number of FPGAs is very high.
Figure 57 shows the effect of increasing the number of FPGAs on the speedup of the multiRSCE system against the multiFPGA Ewald Summation system. As observed in the figure, as the number of FPGAs used increases, the speedup provided by the multiRSCE system decreases. The reason is that the data communication time, which increases with NumP, starts to takes its toll on the overall multiRSCE performance. The zerocrossing point in the graph represents the threshold number of FPGAs NumP_{threshold} when the multiFPGA Ewald Summation system starts to outperform the multiRSCE system.
Table 24 summarizes the thresholds NumP_{threshold} for different grid sizes. For a grid size K=32, the NumP_{threshold} is 4000 and for a grid size of K=64, the NumP_{threshold} is as large as 33000. The threshold number of FPGAs increases with increasing grid size. Although, due to the implementation simplicity, more calculation pipelines should be able to be fitted into an Ewald Summation FPGA, the threshold number of FPGAs is still too large for any grid size of larger than 64,. This suggests that for any realistic hardware system, the multiRSCE system would outperform the multiFPGA Ewald Summation system.
Table 24  Threshold Number of FPGAs when the Ewald Summation starts to be Faster
Grid Size K
(Assume orthogonal box)

Corresponding Number of Particles N

Max. # of RSCEs

Threshold Num. of FPGAs

32

16384

1024

4000

64

131072

4096

33000

128

1048576

16384

1040000

Figure 57  RSCE Parallelization vs. Ewald Summation Parallelization
In plotting the graph in Figure 57, several assumptions have been made. Firstly, the total number of grid points is of the same order as N [2], this provides a fair basis to compare the two algorithms. Secondly, the maximum number of FPGAs used in the SPME hardware is limited by the number of rows in the charge grid. When the number of FPGAs increases beyond the number of rows, it is assumed that the extra FPGAs will be left idle during the 3DFFT processes and they only will participate during the mesh composition and the force calculation stages. Thirdly, it is assumed there is no data communication requirement for the multiFPGA Ewald Summation system, that is, the usage of data replication is assumed for the Ewald Summation hardware. Fourthly, it is assumed that the communication backbone for the multiRSCE system is nonblocking and has a transfer rate of 2GBit/s. Lastly, in plotting the graph, the communication time and the computation time of the multiRSCE system are estimated with Equation 27 and Equation 28.
Equation 27  Communication Time of the MultiRSCE System
Equation 28  Computation Time of the MultiRSCE System
The derivation of the equations stems from Section 3.10 of this thesis. As described in Section 3.10, there are six synchronization and communication points for the multiRSCE operation. In two of them, only the real part of the grid (after the MC step and before the FC step) needs to be exchanged among the RSCEs, and in the remaining four of them, both the real and the imaginary part of the grid need to be exchanged. Therefore, there are altogether (10×K_{X}×K_{Y}×K_{Z}×Precision_{Grid}) bits of data to be exchanged after each synchronization point and each RSCE should theoretically need (10×K_{X}×K_{Y}×K_{Z}×Precision_{Grid})/NumP bits of data for the minimesh or for the FFT rows it is responsible for. Furthermore, assuming that the RSCEs are organized in a 2D mesh with nearestneighbor connection and the multiRSCE system has intelligence to deliver data to any RSCE without blocking; the number of transmission hops for the RSCE to send a request and receive the data can be approximated to be 2×√NumP. Therefore, the communication time can be estimated with Equation 27. On the other hand, Equation 28 is derived from the information shown in Table 14 of Section 3.10.
After the different aspects of the RSCE speedup is examined, the next chapter discusses the verification and the validation effort of the RSCE implementation.
Share with your friends: 