An fpga implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (rsce)

Download 1.53 Mb.

Page	15/25
Date	09.08.2017
Size	1.53 Mb.
	#29150

1 ... 11 12 13 14 15 16 17 18 ... 25

6.6. RSCE Speedup against N 2 Standard Ewald Summation

6.5.Alternative Implementation

Rather than speeding up the whole SPME algorithm in hardware, an alternative is to only speedup the 3D-FFT operation in a dedicated high performance FFT co-processor. This alternative architecture is shown in Figure 53.

Figure 53 - CPU with FFT Co-processor

To investigate the speedup potential of the FFT co-processor architecture, the SPME computational complexity estimation in Table 2 is used. The SPME computational complexity is summarized in Equation 26 for easy reference.

Equation 26 - SPME Computational Complexity (Based on Table 2)

To determine the maximum achievable speedup of the FFT co-processor architecture, the infamous Amdahl’s Law [40] is applied by assuming the FFT co-processor can compute the 3D-FFT and the 3D-IFFT in zero time. Based on Equation 26, Table 20 shows the number of computation cycles involved in the SPME computation with and without the 3D-FFT operation. Furthermore, the table also shows the degree of speedup of the FFT co-processor architecture for different grid sizes K. In generation of the speedup data, it is assumed that the number of particles N is one half of the total number of grid points K×K×K and the interpolation order P equals to 4. As shown in the table, if only the FFT operation is accelerated (to have a computation cycle of zero), the degree of speedup for the FFT co-processor architecture is still very insignificant (< 2x). The main reason is that the B-Spline calculation, the mesh composition, the energy calculation, and the force calculation cannot be overlapped in a CPU and they constitute a large portion of the SPME computation time. Therefore, to accelerate the SPME computation, all major computation steps should be accelerated.

Table 20 - Speedup Potential of FFT Co-processor Architecture

K	N	P	Total SPME Cycles	3D-FFT Cycles	Total SPME Cycle (Without 3D-FFT)	Speedup
8	256	4	8.45E+0484480	9.22E+039216	7.53E+0475264	1.12244912
16	2048	4	7.00E+05700416	9.83E+0498304	6.02E+05602112	1.163265
32	16384	4	5.80E+065799936	9.83E+05983040	4.82E+064816896	1.204082
64	131072	4	4.80E+0747972352	9.44E+069437184	3.85E+0738535168	1.244898
128	1048576	4	3.96E+08396361728	8.81E+0788080384	3.08E+08308281344	1.2985714

6.6. RSCE Speedup against N² Standard Ewald Summation

With a lower bound speedup of 3x, the speedup ability of the single-QMM RSCE is not significant. This is due to the lack of parallelism in the SPME algorithm and also due to the limited QMM memory access bandwidth. Although the impact of limited QMM bandwidth is mitigated with the usage of more QMM memories, the lack of parallelism in the SPME algorithm still bottlenecks the RSCE speedup. Furthermore, it is obvious that the Standard Ewald Summation [28, 33] is easier to implement and is also easier to parallelize [28, 33]. This leads to a question on why the SPME algorithm should be implemented in hardware instead of the Standard Ewald Summation. This question can be answered by the plot in Table 21 and Figure 54.
In the plot in Figure 54, a negative Log(Speedup) value means there is a slow-down to use the RSCE to perform the forces and energy calculations; while a positive value means a speedup. The plot uses the RSCE computation clock cycle estimation described in Equation 25 and simply takes N² as the clock estimation for Standard Ewald Summation. The plot shows that as the number of particles N increases to a threshold point N_threshold (the zero-crossing point in the graph), the intrinsic O(N×Log(N)) RSCE implementation starts to provide significant speedup against the O(N²) Ewald Summation algorithm. The threshold N_threshold increases as the grid size K and interpolation order P increases. The relationship of N_threshold and the interpolation order P is illustrated in Figure 55. Table 21 shows the threshold points for different simulation settings. As shown in the table, the RSCE starts to provide speedup against the Ewald Summation when the number of particles N is fairly small.

Table 21 - RSCE Speedup against Ewald Summation

Simulation Setting	Minimum N to Observe a Speedup (N_threshold)
P=4, K=32	700
P=4, K=64	3500
P=8, K=64	5000
P=4, K=128	7000
P=4, K=256	13000

Figure 54 - Single-QMM RSCE Speedup against N² Standard Ewald

Figure 55 - Effect of P on Single-QMM RSCE Speedup

Figure 56 plots the RSCE speedup against the Ewald Summation when the number of particles N is of the same order as the number of grid points K×K×K. This eliminates the speedup dependency of the varying N and varying K. As observed in the plot, the N_thresholdincreases with increasing interpolation order P. Table 22 summarizes the result.

Table 22 - RSCE Speedup against Ewald Summation (When K×K×K = ~N)

Interpolation Order P	N_threshold
4	200
8	1584
12	5012

Figure 56 - RSCE Speedup against the Ewald Summation

(When K×K×K is of the same order as N)

Directory: ~pc
~pc -> The Tablet War: Apple v s The Rest
~pc -> From: object-oriented analysis and design, Grady Booch, Addison-Wesley, 1998
~pc -> Analysis of an Industry Price War: The Tablet price war
~pc -> Biography of Pok Chi Lau Home address: 2600

Download 1.53 Mb.

Share with your friends:

1 ... 11 12 13 14 15 16 17 18 ... 25

An fpga implementation of the Smooth Particle Mesh Ewald Reciprocal Sum Compute Engine (rsce)

6.5.Alternative Implementation

6.6. RSCE Speedup against N2 Standard Ewald Summation

6.6. RSCE Speedup against N² Standard Ewald Summation