The low-power IC-design train has long ridden the rails of lowered supply voltage. However, these lowered supply rails are tangentially approaching transistor threshold voltages and have long been headed for a serious collision because transistors in large, nanometer ICs run closer and closer to their switching limits. When designing these large circuits, chip designers and EDA tools must make allowances for noise or voltage droop on the supply rails and noise on the signal interconnects within the chip and that means that the designs can’t really run the transistors as fast as possible or at the lowest possible voltage without risking imperfect operation. And who wants to risk imperfect circuit operation? Well, Intel for one.
In a recent article published in the MIT Technology Review, Katherine Bourzac writes up a report from Intel Labs about an experimental 45nm chip that allows circuits to run at sub-optimum voltage and somewhat-too-fast frequency settings. Most of the time, there’s no problem because there’s not enough noise or droop to cause the circuits to compute incorrectly. However, sometimes, under certain conditions, there will be errors. What to do? Add error-detection circuitry to detect errors when they happen and then back up one step in the calculation, raise the operating voltage a bit or drop the operating frequency a bit, re-run the calculation to get the right result, and then back the supply voltage down to normal. This is research into what Intel Labs calls “resilient circuits.”
Is there a benefit to this approach? Specifically, is there a power benefit? Apparently, there is. Bourzac quotes Wen-Hann Wang, director of circuits and systems research at Intel and vice president of Intel Labs, who says that even with the extra error-detection circuitry, the net power savings can be a whopping 37%. (Or, if you’re a speed freak, you can get 21% faster operation without reducing operating power.) Wang points out that today’s chips are designed to operate in demanding, multimode scenarios such as “playing a graphics-rich game, uploading video to Facebook, and surfing the Web” (Isn’t it amazing how cell-phone scenarios have replaced computer-use scenarios these days?) and that today’s devices must be designed to handle such scenarios correctly, which means that the chip’s circuits will be overdesigned and will use excessive power most of the time, when simpler operating modes are in use. An error-detection-and-correction scheme allows the design of chips that only use additional power when it’s needed—when there’s an error.
There are at least two more factors to consider as well. First, chips age. As they do, device thresholds change and metal migrates, leading to minute changes in the currents flowing within the chip—changes that deviate from modeled operating scenarios created during chip design. The normal result of these changes for devices that are designed to run perfectly all the time is that the circuitry eventually does not run perfectly and the chip effectively dies even though it actually could operate properly at a slightly higher operating voltage or a lower operating frequency. Apparently, according to Bourzac’s article, the addition of error-detecting-and-correcting circuitry and algorithms also compensates for the problems associated with chip aging.
Second, as Moore’s Law takes the industry down the rabbit hole of shrinking geometries, many more error sources appear. That makes error-detection-and-correction schemes even more attractive and no doubt that is why Intel Labs is looking into the design of such circuitry now rather than later.
I think that the advent of real error-detecting-and-correcting computational circuitry is long overdue. On-chip-variability already causes enough headaches to trigger more research into how digital circuitry must deal with errors in a probabilistic world, not the absolutely perfect Boolean world we’ve come to assume over the 70 some years of digital design. The storage and memory worlds got the call long ago. Disk drives became probabilistic with the adoption of PRML (partial-response, maximum likelihood) coding more than a decade ago and have always had to use error detection and correction to deal with real-world, flawed storage media. DRAM and NAND manufacturers long ago adopted redundant design to allow for dead bits, rows, and columns in their devices. Viterbi, Turbo, and other algorithms protect digital data from errors inherent in the transmission over the air, with all the associated noise and reflections that are part of everyday cellular telephony. So, is digital design at the chip level different? Apparently not.