An fpga implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (rsce)

RSCE Parallelization vs. Ewald Summation Parallelization

Download 1.53 Mb.

Page	16/25
Date	09.08.2017
Size	1.53 Mb.
	#29150

1 ... 12 13 14 15 16 17 18 19 ... 25

6.7.RSCE Parallelization vs. Ewald Summation Parallelization

Up to now, when performing the calculations for the Coulombic interactions under the periodic boundary condition, all hardware MD systems [23, 28] chose to implement and parallelize the Ewald Summation. There is no significant effort to implement and parallelize the SPME algorithm in hardware. During the development stage of this thesis, there are several observations that may help explain this trend.
Firstly, the SPME algorithm itself is more difficult to implement in hardware than the Ewald Summation. Although the calculations for the direct sum in the SPME algorithm and in the Ewald Summation are identical, the difference is in the reciprocal sum calculations. The reciprocal sum calculation in the SPME is difficult because it involves some complex mathematical operations such as the B-Spline interpolation and the 3D-FFT. In addition to their implementation difficulty, these mathematical operations also complicate the analytical precision analysis on the fixed-point arithmetic typically used in hardware.
Secondly, the SPME algorithm is a sequential algorithm in which the energy and forces are computed with a sequence of calculation steps. Although several calculation steps can be overlapped, this sequential characteristic of the SPME algorithm still prohibits the full exposure to the FPGA parallelization capability. One of the main reasons that the hardware implementation of the SPME algorithm needs to be sequential is that in each calculation step, the QMM grid needs to be updated before the next step can proceed. For example, the 3D-IFFT operation cannot be started before the MC block finishes composing the QMM grid.
Thirdly, on a system-level view, if the reciprocal sum calculation is parallelized to multiple FPGAs, the data communication requirement of the SPME algorithm is more demanding than that of the Ewald Summation. With the Ewald Summation, a replicated data scheme (that is each FPGA holds the information for all N particles) can eliminate the data communication requirement during the energy and force calculation. However, with the SPME algorithm, there is no obvious way to eliminate the data communication requirement. The reason is that the data communication among multiple FPGAs is needed to exchange the updated QMM grid information in the intermediate steps.
Lastly, while the opportunity of parallelization in the Ewald Summation is limited by the number of particles N, the opportunity of parallelization in the SPME algorithm is limited by the grid size K or the grid plane size K×K. The reason is that the FFT operation in the SPME algorithm is more efficient when each FPGA performs the FFT for a row. Therefore, the grid size K limits the maximum number of FPGAs to be used in parallelizing the SPME calculations. Table 23 shows the relationship between the grid size K and the maximum number of FPGA to be used. For a large grid size K, if each RSCE performs the FFT for a row during the 3D-FFT operation, the maximum number of RSCEs is well over ten thousands. It would be impractical and unrealistic to build a hardware system with more than ten thousands RSCEs.

Table 23 - Maximum Number of RSCEs Used in Parallelizing the SPME Calculation

Grid Size K (Assume orthogonal box)	Max. # of RSCEs (Each RSCE takes a plane during FFT)	Max. # of RSCEs (Each RSCE takes a row during FFT)
32	32	1024
64	64	4096
128	128	16384
256	256	65536

Although there are limitations and difficulties in implementing and parallelizing the SPME algorithm in hardware, the main justification for implementing the SPME algorithm is the intrinsic N×Log(N) complexity of the SPME algorithm. Although this complexity benefit will be offset by the limited parallelization opportunity and data communication requirement when the number of FPGAs used increases to a certain threshold NumP_threshold, this threshold number of FPGAs is very high.

Figure 57 shows the effect of increasing the number of FPGAs on the speedup of the multi-RSCE system against the multi-FPGA Ewald Summation system. As observed in the figure, as the number of FPGAs used increases, the speedup provided by the multi-RSCE system decreases. The reason is that the data communication time, which increases with NumP, starts to takes its toll on the overall multi-RSCE performance. The zero-crossing point in the graph represents the threshold number of FPGAs NumP_threshold when the multi-FPGA Ewald Summation system starts to outperform the multi-RSCE system.
Table 24 summarizes the thresholds NumP_threshold for different grid sizes. For a grid size K=32, the NumP_threshold is 4000 and for a grid size of K=64, the NumP_threshold is as large as 33000. The threshold number of FPGAs increases with increasing grid size. Although, due to the implementation simplicity, more calculation pipelines should be able to be fitted into an Ewald Summation FPGA, the threshold number of FPGAs is still too large for any grid size of larger than 64,. This suggests that for any realistic hardware system, the multi-RSCE system would outperform the multi-FPGA Ewald Summation system.

Table 24 - Threshold Number of FPGAs when the Ewald Summation starts to be Faster

Grid Size K (Assume orthogonal box)	Corresponding Number of Particles N	Max. # of RSCEs	Threshold Num. of FPGAs
32	16384	1024	4000
64	131072	4096	33000
128	1048576	16384	1040000

Figure 57 - RSCE Parallelization vs. Ewald Summation Parallelization

In plotting the graph in Figure 57, several assumptions have been made. Firstly, the total number of grid points is of the same order as N [2], this provides a fair basis to compare the two algorithms. Secondly, the maximum number of FPGAs used in the SPME hardware is limited by the number of rows in the charge grid. When the number of FPGAs increases beyond the number of rows, it is assumed that the extra FPGAs will be left idle during the 3D-FFT processes and they only will participate during the mesh composition and the force calculation stages. Thirdly, it is assumed there is no data communication requirement for the multi-FPGA Ewald Summation system, that is, the usage of data replication is assumed for the Ewald Summation hardware. Fourthly, it is assumed that the communication backbone for the multi-RSCE system is non-blocking and has a transfer rate of 2GBit/s. Lastly, in plotting the graph, the communication time and the computation time of the multi-RSCE system are estimated with Equation 27 and Equation 28.

Equation 27 - Communication Time of the Multi-RSCE System

Equation 28 - Computation Time of the Multi-RSCE System

The derivation of the equations stems from Section 3.10 of this thesis. As described in Section 3.10, there are six synchronization and communication points for the multi-RSCE operation. In two of them, only the real part of the grid (after the MC step and before the FC step) needs to be exchanged among the RSCEs, and in the remaining four of them, both the real and the imaginary part of the grid need to be exchanged. Therefore, there are altogether (10×K_X×K_Y×K_Z×Precision_Grid) bits of data to be exchanged after each synchronization point and each RSCE should theoretically need (10×K_X×K_Y×K_Z×Precision_Grid)/NumP bits of data for the mini-mesh or for the FFT rows it is responsible for. Furthermore, assuming that the RSCEs are organized in a 2D mesh with nearest-neighbor connection and the multi-RSCE system has intelligence to deliver data to any RSCE without blocking; the number of transmission hops for the RSCE to send a request and receive the data can be approximated to be 2×√NumP. Therefore, the communication time can be estimated with Equation 27. On the other hand, Equation 28 is derived from the information shown in Table 14 of Section 3.10.

After the different aspects of the RSCE speedup is examined, the next chapter discusses the verification and the validation effort of the RSCE implementation.

Directory: ~pc
~pc -> The Tablet War: Apple v s The Rest
~pc -> From: object-oriented analysis and design, Grady Booch, Addison-Wesley, 1998
~pc -> Analysis of an Industry Price War: The Tablet price war
~pc -> Biography of Pok Chi Lau Home address: 2600

Download 1.53 Mb.

Share with your friends:

1 ... 12 13 14 15 16 17 18 19 ... 25