The discrete element finite element method (DEFEM) is a promising approach for creating virtual reality and gaming environments that exhibit highly realistic physical behaviour. However, the DEFEM equations are computationally expensive, and cannot be solved in real time on desktop PCs and workstations for complex virtual environments.
Many numerical problems can be greatly speeded up by using hardware accelerators, specialized integrated circuits that exploit high levels of parallelism to give rapid solution.
Generally speaking, the goals of the first three months are
1. Get familiar with the fundamental theories about the Finite Element Methods and its typical application;
2. Gain a comprehensive perspective throughout FPGA based matrix equation solution arithmetic;
3. Design software simulation and hardware solution using parallel technique for simple finite element.
The traditional field for field programmable gate arrays (FPGAs) was the embedded systems where integer and fixed point arithmetic are prevalent. Benefit from its resource densities incensement, FPGAs become capable to dealing with efficiently implementing large scale scientific applications involving floating point computations. FPGAs have already shown its power in a large number of integer and fixed-point applications which have much greater performance then microprocessors.
According to Moore’s law, FPGA devices may have three to eight more peak floating-point performance than comparable microprocessors by 2009. It seems that moving floating-point arithmetic from microprocessors to FPGAs will have more potential to speed up. Actually, resent extensive researches based on FPGAs arithmetic mainly have two different solutions to deal with floating-point. Some of them choose to use the custom floating-point formats in FPGAs[5-9]. The other way is to translating floating-point to fixed points or to optimizing the bit widths of floating-point formats as an alternative[3, 11].
The first thing stage to achieve on FPGAs is the common Basic Linear Algebra Subroutine (BLAS), typically including vector dot product (DDOT), matrix-vector multiply (DGEMV), and matrix multiply(DGEMM). Several approach[3, 13] shows that FPGAs and Reconfigurable Computing platforms have the ability to perform better then the microprocessor recently, and will be better with the rapid increase of the resource densities. They raised that using more floating-point units in a design on FPGA will leads to smaller latency, but requires more area and higher memory bandwidth. Besides, unlike CPUs, FPGAs are often limited by peak FLOPs rather than by memory bandwidth.
There are lots of arithmetic to solve equation matrix, such as Jacobi Iteration, Gauss-Seidel Iteration, LU decomposition and so on. Unfortunately, not all the methods which can efficiently used on microprocessors can directly move to FPGAs. This is because that the root for the high speed performance on FPGAs is parallel process. Those loop-based or iteration-based methods have to make some changes to meet the features of FPGA. Several efficient improvements have been done on different equation solving arithmetic[14, 15].
There are also some FPGA-based applications about FEM using s on different fields[16, 17]. These efforts gave the common steps to dealing with FEM applications to follow.
The aim of this project is to investigate the issues involved in creating hardware accelerators for real-time solution of the DEFEM equation on low cost plug-in boards for desktop PCs. This will involve investigation of a variety of formulations of the DEFEM in order to identify which can be best accelerated by low cost hardware with a clean partition between hardware and software. It will also involve design of the hardware accelerators and evaluation of their effectiveness and scalability to large-scale problems.
Initial Stages of Work
Based on the research of last three months, I chose the beam model as the first finite element to build the common structure for future work.
My work included two parts, software and hardware. The software part was written in m-files. In these files, the whole procedure was divided into four stages: Data Initialization, Equation Matrix Solution, Discrete Element Position Calculation and Shape Plotting. The most important part within these three is Equation Matrix Solution. At the first time, I tried to use the Jacobi Iterative method to solve the problem. This scheme was widely used on microprocessors because it doesn’t need much hardware resource and have high efficiency. There are also some approaches already done for the FPFA platform. Unfortunately, the result shows that the iteration was really hard to convergent because of its strict conditions. The equations could not get a reasonable result even I use the simplest boundary conditions whose result absolutely can be calculated out. After a number of attempts, the final arithmetic I used was the LDLt scheme, which can be seen as an extension of the LU method. And the LU method was considered the best one to fit the FEM. The software part is the first stage to make sure that the whole design can run correctly on the arithmetic level and get a result of the given problem to do comparison with the hardware generated result.
Discrete Element Position Calculation
Equation Matrix Solution
Position of Discrete Elements
The hardware design was written in VHDL language. It has the same structure with the software one, but was much more difficult and challengeable. From the feature of the different block mentioned in above, the Data Initialization, and Shape Plotting were run on the PC client. Only the procedures concern about large amount of calculations were moved to the FPGA board [Figure ].
Figure Four Blocks of the Hardware
The parallel technique was used in the Discrete Element Position Calculation block. Each sub-block only response for the points within the area given by the Equation Matrix Solution block and defined by the given Finite Element model.
Plan for Next Six Months
This section gives a plan and timetable for next six months. A Gantt chart is represented below showing an approximate schedule [Figure ]. Generally speaking, there are three main stages for the following six months. The first stage is to get familiar with the new hard ware language, Handle-C. The expected duration for complete this stage is about three weeks. The second stage is to find out a highly efficient FPGA-based arithmetic to solve the equation matrix and design a compatible hardware using parallel technique. This stage may take about two months to finish. The final stage is for link the whole system together and make sure it can meet the needs of the practical applications. Apart form the above objects background reading will last throughout the whole period.
Figure Plan for the Following Six Months
 M. C. Smith, J. S. Vetter, and S. R. Alam, "Scientific computing beyond CPUs: FPGA implementations of common scientific Kernels," Oak Ridge National Laboratory, USA 2005.
 T. Kieron, T. Kieron, M. Konstantinos, A. C. George, and A. P. L. Philip Leong, "FPGA Based Acceleration of the Linpack Benchmark: A High Level Code Transformation Approach," in Field Programmable Logic and Applications, 2006. FPL '06. International Conference on, 2006, pp. 1-6.
 K. D. Underwood and K. S. Hemmert, "Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance," in Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, 2004, pp. 219-228.
 U. Keith, "FPGAs vs. CPUs: trends in peak floating-point performance," in Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays Monterey, California, USA: ACM, 2004.
 N. Shirazi, N. Shirazi, A. Walters, and P. Athanas, "Quantitative analysis of floating point arithmetic on FPGA based custom computing machines," in FPGAs for Custom Computing Machines, 1995. Proceedings. IEEE Symposium on, 1995, pp. 155-162.
 B. Pavle and L. Miriam, "A Library of Parameterized Floating-Point Modules and Their Use," in Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications: Springer-Verlag, 2002.
 J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, and D. Poirier, "A flexible floating-point format for optimizing data-paths and operators in FPGA based DSPs," in Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays Monterey, California, USA: ACM, 2002.
 G. Altaf Abdul, L. Wayne, Y. K. C. Peter, S. Nabeel, and H. James, "Automating Customisation of Floating-Point Designs," in Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications: Springer-Verlag, 2002.
 L. Jian, T. Russell, and M. Oskar, "Floating Point Unit Generation and Evaluation for FPGAs," in Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines: IEEE Computer Society, 2003.
 M. P. Leong, M. Y. Yeung, C. K. Yeung, C. W. Fu, P. A. Heng, and P. H. W. Leong, "Automatic Floating to Fixed Point Translation and its Application to Post-Rendering 3D Warping," in Proceedings of the Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines: IEEE Computer Society, 1999.
 A. Gaffar, O. Mencer, W. Luk, P. Y. K. Cheung, and N. Shirazi, "Floating-point bitwidth analysis via automatic differentiation," in Field-Programmable Technology, 2002. (FPT). Proceedings. 2002 IEEE International Conference on, 2002, pp. 158-165.
 D. Yong, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, "64-bit floating-point FPGA matrix multiplication," in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays Monterey, California, USA: ACM, 2005.
 L. Zhuo and V. K. Prasanna, "Design tradeoffs for BLAS operations on reconfigurable hardware," in Parallel Processing, 2005. ICPP 2005. International Conference on, 2005, pp. 78-86.
 R. M. Gerald and K. P. Viktor, "An FPGA-Based Floating-Point Jacobi Iterative Solver," 2005, p. 420.
 M. Fischborn, P. Kuo-Peng, N. Sadoswki, J. P. A. Bastos, and J. A. T. J. Trevisan, "LU Parallel Preconditioning with Block Intersection Applied to FEM on Computer Clusters," in Electromagnetic Field Computation, 2006 12th Biennial IEEE Conference on, 2006, pp. 404-404.
 K. J. Paar, P. M. Athanas, and C. M. Edwards, "Implementation of a finite difference method on a custom computing platform," in High-Speed Computing, Digital Signal Processing, and Filtering Using reconfigurable Logic, Proc. SPIE 2914, 1996, pp. 44-53.
 E. Motuk, E. Motuk, R. Woods, and S. Bilbao, "Implementation of finite difference schemes for the wave equation on FPGA
Implementation of finite difference schemes for the wave equation on FPGA," in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on, 2005, pp. iii/237-iii/240 Vol. 3.