COMPUTER TECHNOLOGY INSTITUTE 2002
________________________________________________________________________________
TECHNICAL REPORT No. 20020103
“Handling zero in diminishedone modulo 2^{n}+1 adders”
C. Efstathiou, H. T. Vergos & D. Nikolos
24^{th} of January, 2002
Abstract
Zero treatment in diminishedone modulo addition has traditionally been performed separately, leading to slow and area consuming implementations. To overcome this, based on an earlier used enhanced number representation, we introduce novel carry look ahead and parallelprefix architectures for diminishedone modulo adders that can also handle operands equal to 0. Translators for the new representation are also given.

Introduction
Modulo arithmetic appears to play an important role in many applications. The Residue Number System (RNS) [Sonderstrand et al. 1986, Bayoumi et al. 1987, Elleithy and Bayoumi 1992, Koren 1993] is a first application field. A set of moduli, suppose , that are pairwise relative prime is used to define a RNS. Any integer , with , where has a unique representation in the RNS, given by the tuple of residues , where , if and ,_{ }if . A RNS operation, suppose , is defined as , where . In most RNS applications is either addition, subtraction or multiplication. Since the computation of only depends upon , and , each is computed in parallel in a separate arithmetic unit, often called channel. Moduli choices of the form [Jenkins and Leon 1977] have received significant attention because they offer very efficient circuits in the area x time^{2} product sense [Paliouras and Stouraitis 1999]. Addition in such systems is performed using three channels, that in fact are a modulo (equivalently one's complement), a modulo and a modulo adder [Sonderstrand et al. 1986, Koren 1993].
Modulo adders are also utilized as the last stage adder of modulo multipliers. Modulo multipliers find applicability in pseudorandom number generation [Lehmer 1951], cryptography [Zimmermann et al. 1994, Curiger 1993, Lai and Massey 1990] and in the Fermat number transform, which is an effective way to compute convolutions [Ma 1998].
The addition delay in a RNS application which uses the moduli, is dictated by the modulo channel, since this requires bit wide operands. For overcoming the problem of dealing with bit wide circuits for the modulo 2^{n}+1 channel, the diminishedone number system [Leibowitz 1976] has been adopted in several residue number system implementations, for example [Jenkins 1979, 1982]. In this system each number is represented by , with . Therefore the adoption of the diminishedone number system leads to modulo adders and multipliers of bits wide operands [Zimmermann 1997, 1999, Efstathiou et al. 2001, Vergos et al. 2001, 2002]. Implementations of diminishedone adders that are as fast as the modulo and modulo ones have been reported in [Vergos et al. 2001, 2002]. However, a new problem arises; the special treatment required for operands equal to 0. The previous proposals on diminishedone adders [Zimmermann 1997, 1999, Efstathiou et al. 2001, Vergos et al. 2001, 2002] totally ignore this subject with the exception of [Agrawal and Rao 1978]. To overcome the problem of zero operands, Agrawal and Rao [1978] proposed to adopt in their representation a zero indication bit and showed how carry look ahead (CLA) adders that can handle zero operands can be implemented. CLA adders that require one or two cycles have been proposed in [Agrawal and Rao 1978]. However, the single cycle CLA adders proposed in [Agrawal and Rao 1978] do not handle correctly neither the modulo carries nor the zero indication bit of the result, as shown in the Appendix.
In this paper, we adopt a similar to [Agrawal and Rao 1978] number representation. We present three architectures for modulo addition, that can also handle zero operands. Our CLA architecture requires a single addition cycle and has been derived by engaging the output carry equation into the CLA unit. Our parallelprefix adders with a fast carry increment stage have the same logical depth and hence speed as well as the same area complexity with the diminishedone adders with a carry increment stage proposed in [Zimmermann 1997, 1999] that can not handle zero operands. To further decrease the number of required prefix levels, we propose a totally parallelprefix architecture that cancels the need for a separate carry increment stage by performing carry recirculation at each prefix level [Kalamboukas et al. 2000, Vergos et al. 2001, 2002]. Our totally parallelprefix adders offer the same logical depth and complexity as the fastest modulo [Kogge and Stone 1973, Ladner and Fischer 1980], modulo [Kalamboukas et al. 2000] and diminishedone adders [Vergos et al. 2001, 2002], with the extra advantage of handling zero operands. Converters between the adopted representation and the modulo binary number representation are finally given.

Problem definition
Suppose that a number , with , is represented by bits, as , where is the zero indication bit and is the diminishedone representation of , that is :
and
Obviously, (the notation is used for logical negation). A similar representation was adopted in [Agrawal and Rao 1978]. If we decide to ignore the representation of 0 and concentrate only on the part, then efficient bits wide adders can be constructed as proposed in [Zimmermann 1997, 1999, Efstathiou et al. 2001, Vergos et al. 2001, 2002]. However, the need to treat 0 distinctly will result in a circuit as the one shown in figure 1.
Figure 1. The architecture of an adder that treats 0 as a special case.
At the output of the adder a row of multiplexers is added in order to select the correct output. When none of the operands is zero (both and are 0) the output of the adder is propagated. When one of the two operands is zero, the other operand is allowed to propagate. When both operands are 0, the all 0s word is propagated. Logic also needs to be added for computing the zero indication bit , of the result.
Obviously, the adders that follow the architecture of figure 1 are not the best choice in RNS applications, since the delay of the multiplexers makes the delay of the modulo channel greater than channels of the form and . The required implementation area is also increased substantially when compared against the other channels.

The new proposed modulo 2^{n}+1 adders
In this section we propose CLA and parallelprefix adders with embedded treatment of zero operands. We first analyse the logic of and , for the adopted number representation.
Relation , with , implies that or that and , or equivalently that and . Therefore, the zero indication bit, , of the result should be one, when either both operands are 0, or when both operands are nonzero but their diminishedone parts are complementary. These two cases are expressed by the following relation for :
(1)
where and , with and the notations are used for the logical AND, OR and exclusiveOR operations respectively. The resulting implementation is presented in figure 2. Note that the part of figure 2 which produces is not required in the case of diminishedone adders whose carry propagate signals are defined as the exclusiveOR of corresponding operands’ bits. Such adders are well known as exclusiveOR adders. On the other hand, the logic producing is required in inclusiveOR adders, that is, in adders whose carry propagate signals are defined as the inclusiveOR of corresponding operands’ bits.
Figure 2. The implementation of .
For the part of the result we can observe that :

If and are both non zero, then according to [Zimmermann 1997, 1999] :
,
or equivalently, in this case can be computed by adding the complement of the carry output () of the modulo addition of and back to the sum. Since in this case , we can instead of , add .

If one or both operands are zero, and is equal to the modulo sum ofand . That is, in this case can also be computed by a modulo adder with a carry input of .
According to the above analysis, the computation of the diminishedone part of the result, , in all cases, can be carried out by a modulo adder whose carry input is connected to a circuit implementing the expression . However, such a direct connection will lead to an oscillating and thus slow circuit. In the following we propose CLA and parallel prefix architectures that do not suffer from oscillations.
3.1. CLA adders
One straightforward way for avoiding oscillations, is to compute two possible sums in two distinct modulo adders and then use function F for choosing between the two sums. However, since this arrangement requires two sets of nbit adders and n multiplexers, a better solution is to design dedicated modular CLA architectures. This can be achieved by engaging the equation of into the CLA unit and simplifying the resulting equations [Agrawal and Rao 1978, Efstathiou et al. 1994, Vergos et al. 2002].
Let and denote the carry generate and propagate terms respectively; that is, and . The carries , with ( is the input carry, while the output carry) in an bit CLA adder are computed by unfolding the recursive equation and in parallel implementing the resulting equations. For the inverted carry at each bit position , we have :
(2)
where . The sum bits are given by , for .
For deriving a CLA architecture for a modulo diminishedone adder that can handle zero operands we engage the equation of in the equations of the carry computation unit of a modulo adder. This procedure will provide us with the equations of the carries for the modulo addition, hereafter denoted by , with . Using (2) recursively, for the least significant carry, we then get :
(3)
Using (3) and the recursive equation we can then derive the rest of the carries’ equations. For and by substituting by (3), we get:
(4)
Given that and that , (4) becomes:
We then can use for computing , _{} for computing _{} and so on, up to:
The above equations sum up to the following general formula, which defines our proposed onelevel CLA diminishedone modulo architecture:
where, (5)
and denotes the modulo of .
Agrawal and Rao [1978] have also proposed single and two cycle CLA adders that can handle zero operands. For their zero indication bit of operand X, , they adopt the complementary logic of the one adopted in this manuscript, that is, . The suggested in [Agrawal and Rao 1978] single cycle CLA adders, do not handle correctly neither the modulo carries nor the zero indication bit of the result, as shown in the Appendix. For example, the zero indication bit is evaluated by (equation 8b of [Agrawal and Rao 1978]) :
which obviously produces the wrong value when the addition operands a and b are complementary.
3.2.ParallelPrefix Adders with carry increment stage
To solve the problem of oscillations in parallelprefix adder architectures, a first solution is to use prefix architectures with fast carry processing as proposed in [Abraham and Gajski 1980] and then utilize the theory developed in [Zimmermann 1997, 1999], for reentering the expression as the carry input. The resulting architecture is outlined in figure 3. Figure 4 presents the gate level implementations of the operators used in figure 3.
Figure 3. The architecture of a parallelprefix adder with a carry increment stage.
Figure 4. Gate level implementations of the operators used in figure 3.
The prefix computation unit can be designed using any of the proposed prefix algorithms [Kogge and Stone 1973, Ladner and Fischer 1980, Brent and Kung 1982, Knowles 2001, BeaumontSmith and Lim 2001]. The notations and are used in figure 3 to denote group generate and group propagate signals of bits . That means that and are respectively the first and the second members of the group relation (assuming that carry input ):
and the operator is defined according to [Brent and Kung 1982] as . In an integer adder, obviously the carry at position is .
Comparing the architecture of figure 3, against the parallelprefix architecture with a carryincrement stage proposed in [Zimmermann 1997, 1999], for diminishedone modulo addition it is obvious that both architectures have similar area and time complexities. However, the architecture of figure 3 in parallel offers treatment of zero operands.
The number of prefix levels required by the architecture of figure 3 is by one more than those of the fastest modulo [Kogge and Stone 1973, Ladner and Fischer 1980], modulo [Kalamboukas 2000] and modulo diminishedone adder architectures [Vergos et al. 2001, 2002], because the latter do not require a carry increment stage. Therefore in an RNS application that uses the moduli set, the architecture of figure 3 limits the maximum execution speed of the system.
3.3.Totally ParallelPrefix Adders
Instead of having a dedicated single stage for the reentering carry, in [Kalamboukas et al. 2000, Vergos et al. 2001, 2002] it has been proposed to perform carry recirculation at each existing prefix level. In this way the need for an extra carry increment stage is cancelled and dedicated totally parallelprefix adder architectures result with one less prefix level.
In the case that the reentering carry is given by the expression , allowing carry recirculation at each existing prefix level, we can get that the carries of the modulo addition are equal to , where is computed by the prefix equations :
(6),
where :
In several cases, the equations indicated by (6) require more than prefix levels for their implementation. As shown in [Vergos et al. 2001, 2002] these equations can be transformed into equivalent ones that can all be implemented within prefix levels. For the sake of completeness, we present the main idea of the theory in [Vergos et al. 2001, 2002] below.
If and , then since and , we get . This implies that a carry whose equation is of the form of can be equivalently computed by an equation of the form . As shown in [Vergos et al. 2002] for areatime efficient parallelprefix modulo adder implementations the above transformation should be applied times recursively to the equations of the form produced by (6), until:
Example. For the implementation of a modulo 17 adder, that can also handle zero operands, from (6) we get the following set of prefix equations :
, where
We then transform the above into the following equations that can all be implemented within a prefix tree with a logical depth of 2 :
.
Figure 5 presents the attained implementation and the newly introduced prefix operators.
Figure 5. Proposed parallelprefix adder with minimal logical depth. ڤ
The number of prefix levels and therefore the delay of the proposed totally parallelprefix adders is equal to that of the fastest reported modulo [Kogge and Stone 1973, Ladner and Fischer 1980], [Kalamboukas et al. 2000] and diminishedone [Vergos et al. 2001, 2002] adders. This makes them highly applicable in RNS applications. Moreover, their hardware (gates and routing) complexity is analogous to the adders reported in [Kalamboukas et al. 2000, Vergos et al. 2001, 2002].

Translator Circuits
4.1 Translator from modulo 2^{n}+1 to the proposed representation
Let be a binary number with and the targeted representation. The zero indication bit can be computed by :
, (7)
while , or equivalently .
The last relation reveals that can be computed by a modulo adder, that accepts as inputs the all 1s operand and the n least significant bits of operand and as carry input the signal. Assuming an inclusiveOR implementation of the adder, we have that and . Therefore, utilizing (7) we get that the carry at each position is given by :
The latter relation reveals that the adder required for implementing a translator from the binary system to the adopted representation is composed by an exclusiveNOR gate per bit and of a carry computation unit easily implemented as trees of NOR gates.
4.2 Translator from the adopted representation to binary
Using the above introduced notation, in this case we have that . The latter relation reveals that translation to binary can be performed by a simple conditional incrementer.

Conclusions
Several architectures had been recently proposed for diminishedone modulo addition. None of these has dealt with the problem of handling zero operands. All of them propose to treat zero operands separately. Unfortunately, such a treatment leads to slow and area consuming implementations.
In this paper, by utilizing a number representation similar to that of [Agrawal and Rao 1978], we proposed CLA and parallelprefix adders able to also handle zero operands. Our CLA adders perform addition in a single cycle and were derived by engaging the equation of the reentering carry into the carry computation equations. The herein proposed parallelprefix adders with a carry increment stage offer the same logical depth and hence speed as well as the same area complexity as the diminishedone adders with a carry increment stage proposed in [Zimmermann 1997, 1999], with the extra advantage of handling zero operands. The proposed totally parallel prefix adders perform carry recirculation at each prefix level and therefore do not need a separate carry increment stage; their number of prefix levels and therefore their execution speed is the same as the fastest modulo , modulo and modulo diminishedone adders. Translators between the adopted number representation and the modulo binary system were finally presented.

Appendix
The adders presented by Agrawal and Rao [1978], were derived based on the following equations (The numbering of the equations is the same with that of the original manuscript) :
(7a)
(7b)
(7c)
(7d)
For avoiding the oscillations resulting by a direct connection of a carry output back to the input, Agrawal and Rao [1978] first propose two cycle CLA adders. In these adders, the zero indication bit of the result is given by the relation :
(6)
where is the registered carry output of an addition with zero carry input, that is :
and the output carry of an addition with a carry input equal to , that is :
, or equivalently that : .
The single cycle CLA adders proposed by Agrawal and Rao [1978] are based on the following relations :
(8a)
(8b)
In the following we show that relations (8a) and (8b) are incorrect, by deriving the correct ones. For the part of (6), we can get that :
Using the latter relation, (6) yields :
The above relation computes in a single cycle and is equivalent to (1) of the current manuscript, when the complementary logic is used for the zero indication bit. Obviously, the above relation is not equivalent to (8b).
According to the work of Agrawal and Rao, the carries of the single cycle CLA adder are given by relation (7c) when the input carry is equal to . We can therefore get that the carries of the single cycle CLA adder are given by :
Since
,
we get that :
The latter relation computes in a single cycle and is equivalent to (5) of the current manuscript, when the complementary logic is used for the zero indication bit. Obviously, the above relation is not equivalent to (8a).

References
Abraham J. A. and Gajski D. D., 1980. Easily testable highspeed realization of registertransferlevel operations. Proceedings of the 10^{th} Fault – Tolerant Computing Symposium (FTCS10), 339 – 344.
Agrawal D. P. and Rao T.R.N., 1978. Modulo (2^{n}+1) arithmetic logic. Electronic circuits and systems, Vol. 2, No. 6, 186188.
Bayoumi M. A., Jullien G.A., and Miller W.C., 1987. A lookup table VLSI design methodology for RNS structures used in DSP applications. IEEE Transactions on Circuits and Systems, vol. CAS34, 604616.
BeaumontSmith A., and Lim C. C., 2001. Parallel prefix adder design. Proceedings of the 15^{th} IEEE Symposium on Computer Arithmetic, Los Alamitos, CA, USA, 218 – 225.
Brent R. P. and Kung H. T., 1982. A regular layout for parallel adders. IEEE Transactions on Computers, Vol. C31 (3), 260–264.
Curiger A., 1993. VLSI architectures for computations in finite rings and fields. PhD thesis, Swiss Federal Institute of Technology.
Efstathiou C., Nikolos D., and Kalamatianos J., 1994. Area–time efficient modulo 2^{n}1 adder design. IEEE Transactions on Circuits and Systems – II, Vol. 41, no. 7, 463 – 467.
Efstathiou C., Vergos H. T. and Nikolos D., 2001. On the design of modulo 2^{n}±1 adders. Proceedings of the 8^{th} IEEE International Conference on Electronics, Circuits & Systems, (ICECS 2001), Malta, Vol. I, 517  520.
Elleithy K. M., and Bayoumi M. A., 1992. Fast and flexible architectures for RNS arithmetic decoding. IEEE Transactions on Circuits and Systems–II : Analog and Digital Signal Processing, vol. CAS39, 226 – 235.
Jenkins W. K., and Leon B. J., 1977. The use of residue number systems in the design of finite impulse response digital filters. IEEE Transactions on Circuits and Systems, vol. CSA24, no. 4, 191201.
Jenkins W. K., 1979. Recent advance in residue number techniques for recursive digital filtering, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP27, 19 – 30.
Jenkins W. K., 1982. The design of specialized residue classes for efficient recursive digital filter realization. IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP30, 370380.
Kalamboukas L., Nikolos D., Efstathiou C., Vergos H. T. and Kalamatianos J., 2000. Highspeed parallelprefix modulo 2^{n}1 adders. IEEE Transactions on Computers, Special Issue on Computer Arithmetic, Vol. 49, No.7, 673  680.
Knowles S., 2001. A family of adders. Proceedings of the 15^{th} IEEE Symposium on Computer Arithmetic, Los Alamitos, CA, USA, 277  284.
Kogge P. M., and Stone H. S., 1973. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, Vol. 22 (8), 783–791.
Koren I., 1993. Computer Arithmetic Algorithms (Prentice–Hall).
Lai X., and Massey J. L., 1990. A proposal for a new block encryption standard. Presented at EUROCRYPT '90.
Ladner R. E., and Fischer M. J., 1980. Parallel prefix computation. Journal of the ACM, Vol. 27 (4), 831–838, October.
Lehmer D. H., 1951, Proceedings of the 2^{nd} symposium on large–scale digital calculating machinery. (Cambridge, MA : Harvard University Press), 141 – 146.
Leibowitz L. M., 1976. A simplified binary arithmetic for the Fermat number transform. IEEE Transactions on Acoustics Speech and Signal Processing, Vol. ASSP24, 356359.
Ma Y., 1998. A simplified architecture for modulo (2^{n}+1) multiplication. IEEE Transactions on Computers, Vol. 47, No. 3, 333337.
Paliouras V., and Stouraitis T., 1999. Novel high–radix residue number system multipliers and adders. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '99), Orlando, FL, USA, 451 – 454.
Sonderstrand M. A., Jenkins W. K., Jullien G. A., and Taylor F. J., 1986, Residue Number System Arithmetic : Modern Applications in Digital Signal Processing (New York, USA : IEEE Press).
Vergos H. T., Efstathiou C., and Nikolos D., 2001. High speed parallelprefix modulo 2^{n}+1 adders for diminishedone operands. Proceedings of the 15^{th} IEEE Symposium on Computer Arithmetic (ARITH15), Vail, Colorado, pp. 211  217.
Vergos H. T., Efstathiou C. and Nikolos D., 2002. Diminishedone modulo 2^{n}+1 adder design, submitted to the IEEE Transactions on Computers.
Zimmermann R., Curiger A., Bonnenberg H., Kaeslin H., Felber N., and Fichtner W., 1994. A 177 Mb/s VLSI implementation of the international data encryption algorithm. IEEE Journal of Solid  State Circuits, vol. 29 (3), 303 – 307.
Zimmermann R., 1999. Efficient VLSI implementation of modulo (2^{n}1) addition and multiplication. Proceedings of the 14^{th} IEEE Symposium on Computer Arithmetic (ARITH14), Adelaide, Australia, 158–167.
Zimmermann R., 1997. Binary adder architectures for cellbased VLSI and their synthesis, PhD Thesis, Swiss Federal Institute of Technology.
_______________________________________________________________________________________
TECHNICAL REPORT No. 20020103
