Title: Flip-Flop Architectures Tolerant to Multiple-bit Upsets from Cosmic Rays in Autonomic Hardware
Principal researchers: Tapan J. Chakraborty (Bell Labs.), Michael L. Bushnell (Rutgers)
Current collaborators: Wen Yueh (CAC Student)
Status: Ongoing
Summary:
There is an ongoing problem with hardware transient logic errors caused by cosmic ray hits and by -particle hits. As the solar wind from the Sun hits the Earth’s magnetic field and atmosphere, the proton stream undergoes 4 to 5 levels of reaction, and is converted into a neutron flux. These neutrons then hit transistors in VLSI circuits, and cause a logic upset if their effective charge transfer is greater than the critical charge (Qcrit). Unfortunately, the only way to screen out the neutron flux is with 5 to 6 feet of cement packaging. The -particle flux comes from radioactive decay of trace elements in the chip package. Due to Moore’s law scaling, we went from 90 nm chip features to 45 nm features (this year), and Qcrit decreased by a factor of 4. Baumann has projected that the transient logic errors due to this bombardment in 45 nm technology will be greater than the logic errors in unprotected static RAM. RAM memories had to be protected starting in the 1970’s with Error Correcting Codes, and now logic must be protected as well.
This project is concerned with autonomic hardware mechanisms that will detect these transient faults and automatically correct these single event upset (SEU) hardware errors, with no software intervention. We have already focused on correcting errors in the flip-flops of the hardware, because these have a much more direct effect on the system failure rate than transient logic errors in combinational logic. We have two patented crosstalk-tolerant flip-flops: XSEUFF 1 and XSEUFF 2. The XSEUFFs are scan flip-flops, meaning that they have a master latch, a slave latch, and a second master and second slave latch used only in the hardware testing modes. We reuse the two extra latches in functional mode by loading the hardware input from the D line into 3 latches, and then voting on the contents of the 3 latches to find the correct flip-flop output, even in the presence of latch corruption by cosmic rays. XSEUFF 2, instead, samples the D line at 3 distinctly different times around the rising clock edge of the flip-flop, and then votes on the result as well. Another digital system problem is inductive and capacitive crosstalk, which is caused by magnetic and electrical coupling of digital wires on chips and also leads to transient logic errors. The XSEUFF 1 flip-flop also corrects for crosstalk delays. The hardware overhead of XSEUFF 1 (XSEUFF 2) is 54% (37%) and the delay overhead of XSEUFF 1 (XSEUFF 2) is 0% (25%).
At present, we have found that multiple event upsets (MEUs), where a cosmic ray corrupts multiple latches at the same time, are becoming a problem, so we are investigating new solutions that will also scale up to multi-core chips with 64 or more processors. In addition, we are investigating a new type of hardware checker that detects permanent faults in combinational logic, as well. On the full adder, this checker had 125% hardware overhead, compared to 140% for the conventional parity predictor checker. The fault coverage for both checkers was 100%.
Interested Companies: Alcatel-Lucent, Xilinx, AMD, IBM, and Qualcomm
Patent Holders: Rutgers University and Alcatel-Lucent
R. Olivera, A. Jagirdar, and T. Chakraborty, “A TMR Scheme for SEU Mitigation in Scan Flip-Flops,” Proc. of Int’l. Symp. on Quality Electronic Design, pp 905–910, 2007.
T. Chakraborty, R. Olivera, and A. Jagirdar, “A Robust Architecture for Flip-Flops Tolerant to Soft-Errors and Transients from Combinational Circuits,” Proc. of Int’l. Conf. on VLSI Design, pp 39-44, 2008.
Share with your friends: |