Reconfigurable computing

Three Month Report

Supervised by
Dr. Steven F Quigley
Prof. Andrew H C Chan

Lin Zhang

Dec 2007

Index
1.Introduction 4
2.Literature Review 4
3.Project Aims 5
4.Initial Stages of Work 6
5.Plan for Next Six Months 7
Reference 8
Appendix 10
A.1. Personal Web Page 10
A.2. Risk Declaration 10
Introduction
The discrete element finite element method (DEFEM) is a promising approach for creating virtual reality and gaming environments that exhibit highly realistic physical behaviour. However, the DEFEM equations are computationally expensive, and cannot be solved in real time on desktop PCs and workstations for complex virtual environments.
Many numerical problems can be greatly speeded up by using hardware accelerators, specialized integrated circuits that exploit high levels of parallelism to give rapid solution.
Generally speaking, the goals of the first three months are
1. Get familiar with the fundamental theories about the Finite Element Methods and its typical application;
2. Gain a comprehensive perspective throughout FPGA based matrix equation solution arithmetic;
3. Design software simulation and hardware solution using parallel technique for simple finite element.
Literature Review
The traditional field for field programmable gate arrays (FPGAs) was the embedded systems where integer and fixed point arithmetic are prevalent[1]. Benefit from its resource densities incensement, FPGAs become capable to dealing with efficiently implementing large scale scientific applications involving floating point computations[2]. FPGAs have already shown its power in a large number of integer and fixedpoint applications which have much greater performance then microprocessors[3].
According to Moore’s law, FPGA devices may have three to eight more peak floatingpoint performance than comparable microprocessors by 2009[4]. It seems that moving floatingpoint arithmetic from microprocessors to FPGAs will have more potential to speed up. Actually, resent extensive researches based on FPGAs arithmetic mainly have two different solutions to deal with floatingpoint. Some of them choose to use the custom floatingpoint formats in FPGAs[59]. The other way is to translating floatingpoint to fixed points[10] or to optimizing the bit widths of floatingpoint formats as an alternative[3, 11].
The first thing stage to achieve on FPGAs is the common Basic Linear Algebra Subroutine (BLAS), typically including vector dot product (DDOT), matrixvector multiply (DGEMV), and matrix multiply(DGEMM)[12]. Several approach[3, 13] shows that FPGAs and Reconfigurable Computing platforms have the ability to perform better then the microprocessor recently, and will be better with the rapid increase of the resource densities. They raised that using more floatingpoint units in a design on FPGA will leads to smaller latency, but requires more area and higher memory bandwidth[13]. Besides, unlike CPUs, FPGAs are often limited by peak FLOPs rather than by memory bandwidth[3].
There are lots of arithmetic to solve equation matrix, such as Jacobi Iteration, GaussSeidel Iteration, LU decomposition and so on. Unfortunately, not all the methods which can efficiently used on microprocessors can directly move to FPGAs. This is because that the root for the high speed performance on FPGAs is parallel process. Those loopbased or iterationbased methods have to make some changes to meet the features of FPGA. Several efficient improvements have been done on different equation solving arithmetic[14, 15].
There are also some FPGAbased applications about FEM using s on different fields[16, 17]. These efforts gave the common steps to dealing with FEM applications to follow.
Project Aims
The aim of this project is to investigate the issues involved in creating hardware accelerators for realtime solution of the DEFEM equation on low cost plugin boards for desktop PCs. This will involve investigation of a variety of formulations of the DEFEM in order to identify which can be best accelerated by low cost hardware with a clean partition between hardware and software. It will also involve design of the hardware accelerators and evaluation of their effectiveness and scalability to largescale problems.
Initial Stages of Work
Based on the research of last three months, I chose the beam model as the first finite element to build the common structure for future work.
My work included two parts, software and hardware. The software part was written in mfiles. In these files, the whole procedure was divided into four stages: Data Initialization, Equation Matrix Solution, Discrete Element Position Calculation and Shape Plotting. The most important part within these three is Equation Matrix Solution. At the first time, I tried to use the Jacobi Iterative method to solve the problem. This scheme was widely used on microprocessors because it doesn’t need much hardware resource and have high efficiency. There are also some approaches already done for the FPFA platform[14]. Unfortunately, the result shows that the iteration was really hard to convergent because of its strict conditions. The equations could not get a reasonable result even I use the simplest boundary conditions whose result absolutely can be calculated out. After a number of attempts, the final arithmetic I used was the LDLt scheme, which can be seen as an extension of the LU method. And the LU method was considered the best one to fit the FEM. The software part is the first stage to make sure that the whole design can run correctly on the arithmetic level and get a result of the given problem to do comparison with the hardware generated result.
Discrete Element Position Calculation
Equation Matrix Solution
FPGA
PC
Position of Discrete Elements
Initialized Date
The hardware design was written in VHDL language. It has the same structure with the software one, but was much more difficult and challengeable. From the feature of the different block mentioned in above, the Data Initialization, and Shape Plotting were run on the PC client. Only the procedures concern about large amount of calculations were moved to the FPGA board [Figure ].
Figure Four Blocks of the Hardware
The parallel technique was used in the Discrete Element Position Calculation block. Each subblock only response for the points within the area given by the Equation Matrix Solution block and defined by the given Finite Element model.
Plan for Next Six Months
This section gives a plan and timetable for next six months. A Gantt chart is represented below showing an approximate schedule [Figure ]. Generally speaking, there are three main stages for the following six months. The first stage is to get familiar with the new hard ware language, HandleC. The expected duration for complete this stage is about three weeks. The second stage is to find out a highly efficient FPGAbased arithmetic to solve the equation matrix and design a compatible hardware using parallel technique. This stage may take about two months to finish. The final stage is for link the whole system together and make sure it can meet the needs of the practical applications. Apart form the above objects background reading will last throughout the whole period.
Figure Plan for the Following Six Months
Reference
[1] M. C. Smith, J. S. Vetter, and S. R. Alam, "Scientific computing beyond CPUs: FPGA implementations of common scientific Kernels," Oak Ridge National Laboratory, USA 2005.
[2] T. Kieron, T. Kieron, M. Konstantinos, A. C. George, and A. P. L. Philip Leong, "FPGA Based Acceleration of the Linpack Benchmark: A High Level Code Transformation Approach," in Field Programmable Logic and Applications, 2006. FPL '06. International Conference on, 2006, pp. 16.
[3] K. D. Underwood and K. S. Hemmert, "Closing the gap: CPU and FPGA trends in sustainable floatingpoint BLAS performance," in FieldProgrammable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, 2004, pp. 219228.
[4] U. Keith, "FPGAs vs. CPUs: trends in peak floatingpoint performance," in Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays Monterey, California, USA: ACM, 2004.
[5] N. Shirazi, N. Shirazi, A. Walters, and P. Athanas, "Quantitative analysis of floating point arithmetic on FPGA based custom computing machines," in FPGAs for Custom Computing Machines, 1995. Proceedings. IEEE Symposium on, 1995, pp. 155162.
[6] B. Pavle and L. Miriam, "A Library of Parameterized FloatingPoint Modules and Their Use," in Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on FieldProgrammable Logic and Applications: SpringerVerlag, 2002.
[7] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, and D. Poirier, "A flexible floatingpoint format for optimizing datapaths and operators in FPGA based DSPs," in Proceedings of the 2002 ACM/SIGDA tenth international symposium on Fieldprogrammable gate arrays Monterey, California, USA: ACM, 2002.
[8] G. Altaf Abdul, L. Wayne, Y. K. C. Peter, S. Nabeel, and H. James, "Automating Customisation of FloatingPoint Designs," in Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on FieldProgrammable Logic and Applications: SpringerVerlag, 2002.
[9] L. Jian, T. Russell, and M. Oskar, "Floating Point Unit Generation and Evaluation for FPGAs," in Proceedings of the 11th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines: IEEE Computer Society, 2003.
[10] M. P. Leong, M. Y. Yeung, C. K. Yeung, C. W. Fu, P. A. Heng, and P. H. W. Leong, "Automatic Floating to Fixed Point Translation and its Application to PostRendering 3D Warping," in Proceedings of the Seventh Annual IEEE Symposium on FieldProgrammable Custom Computing Machines: IEEE Computer Society, 1999.
[11] A. Gaffar, O. Mencer, W. Luk, P. Y. K. Cheung, and N. Shirazi, "Floatingpoint bitwidth analysis via automatic differentiation," in FieldProgrammable Technology, 2002. (FPT). Proceedings. 2002 IEEE International Conference on, 2002, pp. 158165.
[12] D. Yong, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev, "64bit floatingpoint FPGA matrix multiplication," in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Fieldprogrammable gate arrays Monterey, California, USA: ACM, 2005.
[13] L. Zhuo and V. K. Prasanna, "Design tradeoffs for BLAS operations on reconfigurable hardware," in Parallel Processing, 2005. ICPP 2005. International Conference on, 2005, pp. 7886.
[14] R. M. Gerald and K. P. Viktor, "An FPGABased FloatingPoint Jacobi Iterative Solver," 2005, p. 420.
[15] M. Fischborn, P. KuoPeng, N. Sadoswki, J. P. A. Bastos, and J. A. T. J. Trevisan, "LU Parallel Preconditioning with Block Intersection Applied to FEM on Computer Clusters," in Electromagnetic Field Computation, 2006 12th Biennial IEEE Conference on, 2006, pp. 404404.
[16] K. J. Paar, P. M. Athanas, and C. M. Edwards, "Implementation of a finite difference method on a custom computing platform," in HighSpeed Computing, Digital Signal Processing, and Filtering Using reconfigurable Logic, Proc. SPIE 2914, 1996, pp. 4453.
[17] E. Motuk, E. Motuk, R. Woods, and S. Bilbao, "Implementation of finite difference schemes for the wave equation on FPGA
Implementation of finite difference schemes for the wave equation on FPGA," in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on, 2005, pp. iii/237iii/240 Vol. 3.
Appendix
URL: http:\\postgrad.eee.bham.ac.uk\zhangl
Screen shoot
A.2. Risk Declaration
