A simple, quick and new design of scaling circuit for the modulo set
Fateme ghassemi ahangaran
Department of Computer Engineering Islamic Azad University
Kerman, Iran
ghassemi.fateme@iauk.ac.ir
ABSTRACT
In this study a new and unique design of the residue number system scaling circuit for the 3moduli set which has a dynamic range over 5n is presented. Nowadays 5n dynamic range has proved to be of good performance. The relations in this study are obtained by Chinese Remainder Theorem which is one of the most important conversion algorithms in residue number system.
Key words: residue number system, dynamic range, scaling, scale.
1. INTRODUCTION
As number system different from the conventional ones, the residue number system has attracted more attention from computer calculations scientists in recent a decade which is a result of its unique features and their applications for special-purpose processors. Residue number system has played an important role in computer calculations for ages. This advantage is revealed with the limitation of the carry propagation [1]. As a result performing the calculations in this system can lead to a decrease in delay, hardware cost and power consumption [2]. In residue number system production of carry is parallel, though as we know in other systems distribution chain of the carry is long and makes the process slow. Consequently in the residue number system adding, operations of addition, subtraction and multiplication are done quickly. This system has widespread uses such as digital signal processing including digital filters, picture processing, designing code detection and correction algorithms (in designing safety systems), and encoding[1],[3]. This system can be used specifically in designing Inner Product Step Processor. The primary and most important procedure for designing this system is selecting a modulo set which is prime two by tow, and multiplying them by each other results a dynamic range that shows the number of visible numbers in this system[1]. For transforming weighted numbers (conventional system) for this system, we should divide the preferred number by each member of the set and the residues are usable for calculating in this system. So far various modulo sets have been offered in residue number system that can be categorized by dynamic range and number of members. Examples of such sets would be:,, . DRs resulting by these modulo sets are not suitable for applications with greater range. Recently modulo sets with greater dynamic range are popular. As mentioned above in this study we tried to present scaling circuit for modulo set which has a dynamic range over 5n bit. At first RNS was paid a lot of attention by computer science but due to some problems and hardware malfunctioning, its use was abolished. Though in recent years, they came back by extended use of special-purpose processors and increase in hardware capability in implying complex circuits. Types of hardware meant here could be direct and reverse conversion, various full adders used in them, modulo multiplexers, sign detection circuits, comparing circuits and scaling circuits. Performance boost in any of these circuits can play a significant role in system performance. The point here in this system is that some designing some of the mentioned hardware circuits is by considering algorithms that are used in reverse conversion, ones like Chinese Remainder Theorem, conversion in co-based algorithm and Chinese Remainder Theorem 1 and 2. Here Chinese Remainder Theorem can be applied quickly and parallel and as a reason we used it in this study for the anticipated circuits. Aside from the advantages to residue number system it has disadvantages too. RNS is costly in sign detection, division, and competing, reverse conversion and scaling and in fact has a lot of problems. The tow latter are more urgent to be solved because they are the gateway to design the circuit for conversing numbers from residue to binary systems. In fact reverse means return from residue system to binary system which is a costly and complicated operation and scaling circuit is used for preventing overlapping after each calculation procedure[4]. The difference we can see between scaling and division is that in practice division makes each number in the system a fixed value and we seek the quotient, while in scaling we look for the residue from the number in conventional system by each module in the modulo sets and want them to be calculable using the resulting numbers from residue number system. It can be claimed that in scaling, division operation is omitted meaning that a series of individuals helps derive another series of individuals. In scaling, system’s input and output are both residue considering the modulo set. So we should design scale circuits as many as set members.
2. PREVIOUS WORK
Designing scaling circuits was first discussed by Szabo & Tanaka in 1967. At that time calculation period was in a cycle as long as n clock that n resembled the number of modulo set members. Their suggested method was not error free though it was considered advancement in RNS [6]. In 1973 Okeefe & Wright designed a circuit for the modulo set which was more effective than Szabo’s but was also not error free as well, though it gave closer answers[7]. In 1978 Jullien managed to present an algorithm that needed fewer clocks but also accompanied some errors[8]. In 1981 Taylor & Huang made a design that despite all previous ones that were based on CRT algorithm, used MRC method in its calculations [9]. A year later these two presented a model that was an idea based on modulo sets and LUTs, but like others needed a long n clock of time [10]. In 1984 Polky & Miller presented a new algorithm that needed (N+1) clock of time and the value obtained from it was closer to real value. In fact it carried fewer errors than previous offers but claimed more time [11]. Five years later Shenoy & Kumarson offered a design for scaling that needed a period of log n that was much more considerable than the preceding offers [12]. Finally in 1993 Ulman wrote an essay based on SZabo’s work and promoted his method to a lot of extent [13]. From then on, a lot of suggestions were made in this field but the point is that there were errors in all algorithms but the value didn’t exceed 1.5 units [5]. At that time scaling circuits were applied bye ROM memory and this made the processes slow and power consumption high. Due to this reason researchers were always looking for a way to solve this. However ROM matrixes can be replaced by multiplexers, it increased the cost unacceptably. We would like to mention that there are two ways to design algorithms; one using LUTs and the other by accessing through full adders and almost in all designs based on memory, the cost increases as the number of modules do, while designs with full adders are faster and claim less area. In 2011 Chang could offer a new perspective in scaling circuits by modular full adders which resulted in speed increase and less power consumption[5]. It is necessary to mention that there are also other scaling circuits that benefit from both technologies of LUT and full adders, but in among these circuits, Chang’s is the fastest and the most efficient. A weakness for this circuit would be that it is a design for 3member modulo set with a dynamic range of 3n bit which does not suffice for today’s computer needs. Chang also surveyed scaling circuits for signed numbers in a paper in 2012.
3. DISCUSSION AND CONCLUSION
As mentioned above there are two ways to do scaling algorithms, one using LUTs [14],[15]and the other access through full adders[16-19]. At first all algorithms used ROM but scaling algorithms their problem was high power consumption. However ROM matrixes can be replaced with modular multiplexers, this increases the cost unacceptably. This cost increase would be much more for greater modulo sets. As pointed before almost all designs based on memory increase the cost as the modules’ number increases, while if those designs are manipulated with full adders it would be faster and would claim less area[15]. In this study we surveyed designing a scaling circuit for the modulo set.
CRT algorithm is defined as below:
(1)
(2)
As we know dynamic range is obtained from multiplying modulo set members, so:
(3)
Here are some relationships that can be used for simplifying existing scaling relationships and designing offered circuits.
(4)
(5)
(6)
(7)
If z is an n bit number then:
(8)
Supposing, if k is an integer then always applies. For proving this rule we could use the below rules:
Considering rules (4) and (5) it can be inferred that:
(9)
(10)
Since and in other words p is dividable by q, then:
If , and For calculating s we act as below:
(12)
(11)
(13)
(14)
The reverse of needed multiplexers is gotten as below:
(16)
(15)
(17)
Here we suppose (the second modulo). So y is:
(18)
Based on rule (10) for i=1, 2, 3 simplifying continues as below:
(19)
(20)
(21)
(21)
At first we derive values that are repeated in all relationships:
(22)
(23)
In the above statement we omit , to analyze this omission we can write:
If and و and also then:
Using rule (4-8) we could reconstitute the above statement as below:
In a way that is a non-negative digit. Without the replacements of, and rule (23) will be like:
If is named, then we can simplify the above relationship as follows:
Since, then we can write:
It can be concluded from the above relationships that:
Now using the above replacements and notes below, we derive formulas of preferred scaling circuits.
Note 1: if a k bite digit (such as v) is multiplied by a 2n and we want the result in modulo , then the answer equals n-bit circular shift to the left of the desired digit.
Note 2: if v is negative then the result equals the complementary of the digit v.
3.1. Calculating
(24)
To calculate value of A using note (1) we have:
Also for B we can write:
Finally is gotten as below:
This shape below illustrates how value is designed.
Figure . Hardware of
3.2. Calculating
To prevent sophistication of calculation that has the below value, we can divide it into the 3 sub-formulas , , .
(25)
Now we can simplify for each part like this:
3.2.1. Calculating
Using rule (4-6), we can simplify as below:
(26)
2.2.2. Calculating
Using the simplification that resulted in (23), for we can have:
(27)
3.2.1.Calculating
With the above rules and relationships we can also simplify as below:
(28)
Now with having the value of, , we can finally calculate.
(29)
Designing is as follows:
Figure . Hardware of
3.3. Calculating
Like the 2 previous parts, to calculate this part we use the mentioned rules and relationships.
(30)
Considering rule (7), for we can have:
(31)
And lastly for we can have:
(32)
We can define to organize the above rule as below.
And at last for we have:
(33)
And the shape resulted from the above rules is like this.
Figure . Hardware of
4. CONCLUSION
As it was explained in the beginning of the article, using the residue number system in computer systems is one of the imminent problems for the lack of efficient hardware necessary. This problem can be tackled with designing scaling circuits to a great extent. Needing to use modulo sets with larger dynamic ranges, we chose modulo set . This design caused the omission of division operation and the transforming reverse unit for calculating the overlap in residue number system. As we know in RNS we are supposed to take the residues resulted from preferred modulo set to the conventional number system with using the reverse transformer and then do de division on them and formerly transfer the digit to residue system and this can be costly and time consuming. While there is no need to do these operations in scaling circuits and we can get the division residues directly from the residues available in RNS system without need to use reverse transformer or division. As follows we prove scaling circuits improving role in residue number system with presenting an example.
Without using scaling:
Here as an example we assume that n=3, then modulo set will be {63,8,65}. If preferred numbers’ residues of this set are{24,5,51}, we should firstly transfer the preferred digit to conventional number system that is performed by Chinese residue algorithm:
Up to this step we have only developed preferred digit in residue system. Now we should divide it by the scale that is 8 here in this example (it is obvious that division operation in computer calculations is costly and time consuming), and again transfer the digit to residue number system.
Now we survey through the above calculations and relations with the use of scaling.
With using scaling:
It is concluded that using scaling can improve residue system’s efficiency in computer calculations to a great extent.
As a final point we survey the results of implementing scaling circuits for the presented modulo set. We would like to mention that designs were firstly transformed to VHDL codes and then using Modelsim software we plotted them in terms of design performance. After making sure of the accuracy of what we did before, they were simulated for various values using software ISE on different technologies (Virtex4 , Virtex5 & Virtex6).
Charts of implementation results in Area lines according to Slice, LUT and Latency are announced.
Table . Result for moduli set with virtex4
|
|
|
Virtex 4
|
Latency (ns)
|
Area (LUT)
|
Area (Slice)
|
|
6.254
|
156
|
94
|
N=4
|
7.790
|
283
|
171
|
N=8
|
9.219
|
551
|
330
|
N=16
|
11.627
|
1086
|
639
|
N=32
|
16.441
|
2158
|
1308
|
N=64
|
Table . Result for moduli set with virtex5
|
|
|
Virtex 5
|
Latency (ns)
|
Area (LUT)
|
Area (Slice)
|
|
5.371
|
110
|
44
|
N=4
|
7.054
|
230
|
84
|
N=8
|
7.628
|
447
|
164
|
N=16
|
8.012
|
820
|
322
|
N=32
|
8.961
|
1534
|
638
|
N=64
|
Table . Result for moduli set with virtex6
|
|
|
Virtex 6
|
Latency (ns)
|
Area (LUT)
|
Area (Slice)
|
|
2.939
|
84
|
44
|
N=4
|
4.966
|
214
|
84
|
N=8
|
5.592
|
420
|
164
|
N=16
|
6.328
|
796
|
322
|
N=32
|
7.641
|
1492
|
638
|
N=64
|
As shown in charts all the three technologies Virtex4 , Virtex5 & Virtex6, are increased as n value increases (calculated values for Area according to Slice , LUT and latency are per nanosecond).
In chart 1 average rate of Area (slice) column compared to Area (LUT) column is 1.6. It means the calculated area according to LUT equals 1.6 times the calculated area according to slice with the same n value.
In chart 2 increase in Area column values according to Slice compared to Area column according to LUT has an average rate of 2.56. Here again with increasing n value by 2 times, the area acquired from both Slice and LUT units has 1.93 times average rate and latency has had a growth 1.13 times average rate.
Values obtained from chart 3 show that the increase of Area column according to Alice compared to LUT is 2.32. Doubling n value, the area calculated from Slice and LUT columns have growth rate of 1.94. This value for Latency is 1.07.
With an overall look at the calculated numbers, it can be stated that in Virtex5 we observed the most average growth rate compared to areas resulting from Slice and LUT. The highest average growth rate belongs to Latency column of Virtaex6.
References
[1] Molahosseini, A. S.,Navi, K,Dadkhah, C., Kavehei, O.,Timarchi, S., (2010)." Efficient Reverse Converter Designs for the new 4-moduli SetsandBased on New CRTs". IEEE Trans. Circuits and Systems-I, 57, 823.
[2] Molahosseini, A. S., (2011). "Improving the Delay of Residue-to-Binary Converter for a Four-Moduli Set". Advances in Electrical and Computer Engineering, 11.
[3] Chang, C. H., Low, J., Y., S. (2011). "Simple,Fast and Exact RNS Scaler for the Three-Moduli Set ". IEEE Trans. Circuits and Systems-I, 58, 2686-2697.
[4] Safari, A., Cong, Y. (2012). "Simple,Fast and Synchronous Hybrid Scaling Scheme for the 8-bit Moduli Set ". Journal of Emerging Trends in Computing and Information Sciences.
[5] Anders Lindström, Michael Nordseth, Lars Bengtsson, Amos Omondi, "Arithmetic Circuits Combining Residue and Signed-Digit Representations," Proceedings of the Eighth Asia-Pacific Computer Systems Architecture Conference, vol. 2823, 2003.
[6]
|
G. A. Jullien, "Residue number scaling and other operations using ROM array," IEEE Transaction Coniputer, vol. 27, pp. 325-336, 1978.
|
[7]
|
F. J. Taylor and C. H. Huang, "A floating point residue arithmetic unit," vol. 311, pp. 33-43, 1981.
|
[8]
|
F. J. Taylor and C. H. Huang, "An autoscale residue multiplier," IEEE Trunsaction Computer, vol. 31, pp. 321-325, 1982.
|
[9]
|
D. D. Miller and J. W. Polky, "An implementation ol the LMS alalgorithms in residue number system," IEEE Trunsaction Circuits System, vol. 31, pp. 452-561, 1984.
|
[10]
|
A. P. Shenoy and R. Kumaresan, "Fast base extension using a redundant modulus in RNS," IEEE Transaction Computer, vol. 38, pp. 292-297, 1989.
|
[11]
|
Z. D. Ulman, M. Czyzak, J. M. Zurada, "Effective RNS scaling algorithm with the Chinese remainder theorem decomposition," IEEE Pacific Rim Confference Communication, Computer, Signal Process, pp. 528-531, 1993.
|
[12]
|
R. Zimmermenn, "Efficient VLSI Implementation of Modulo (2^n-1) Addition and Multiplication," Computer Arithmetic, pp. 158-167, Apr 1999.
|
[13]
|
R. Zimmermann, "Binary Adder Architectures for Cell-Based VLSI and their Synthesis," SWISS FEDERAL INSTITUTE OF TECHNOLOGY, ZURICH, 1997.
|
[14] Shenoy, M. A. P., Kumaresan, R. (1989) “A fast and accurate rns scaling technique for high speed signal processing,” IEEE Trans. Acoust, Speech, Signal Process, 37, 929–937.
[15] Ulman, M. C. Z. D., Zurada, J. M. (1993) “Effective rns scaling algorithm with the chinese remainder theorem decomposition,” IEEE Pacific Rim Conf. Commun., Computer, Signal Process, 528–531.
[16] Griffin, F. T. M., Sousa, M.( , 1988) “New scaling algorithm for the Chinese remainder theorem,” Proc. Conf. Rec. 22nd Asilomar Conf Signals, System., Computer , 375–378.
[17] Kong, Y. Phillips, B. (2009) “Fast scaling in the residue number system,”Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 17, 443 –447.
[18] Dasygenis, M., Mitroglou, K., Soudris, D., Thanailakis, D.(2008)“A full-adder-based methodology for the design of scaling operation inresidue number system,” Circuits and Systems I: Regular Papers, IEEE Transactions on. 55, 546 –558.
[19]Benardson, P. (1985) “Fast memoryless, over 64 bits, residue-to binary convertor,” Circuits and Systems, IEEE Transactions on, 32, 298 – 300.
Share with your friends: |