1972: When scientific calculators truly went low power

Dave Cochran recently wrote about his long engineering career at Hewlett-Packard on the www.hpmemory.org Web site. Who? What Web site? Well, the Web site is an amazing living museum that’s a tribute to Bill and Dave’s HP. And Dave Cochran is likely one of the most important people you’ve never heard about in the annals of low-power design. He spent 25 years at HP, starting as a part-time test technician in 1956 and departing as a celebrated HP engineer in 1981.

Between those two years, Cochran worked on a huge number of projects including the HP 204B audio oscillator where he used transistors and a hugely ingenious double-spiral-cam potentiometer actuator to design the famous tungsten light bulb out of Bill Hewlett’s original audio oscillator design. (That double-cam actuator makes a linear pot do unnatural things and you need to see the short, 3-second video to believe it!) But it’s what he did in the 1960s and 1970s that make him the topic of this particular blog post.

Cochran was working at HP Labs when Malcolm McMillan and Tom Osborne dropped by with two very different calculator prototypes. It was the legendary Barney Oliver, Grand Wizard of HP Labs who conjured the idea of merging these two very different machines into one massively powerful scientific calculator. McMillan’s design—called Athena—could perform transcendentals using Jack Volder’s CORDIC algorithms but it had a fixed-point architecture so it was not deemed accurate enough for repeated engineering calculations. Osborne’s design was barely more than a simple 4-banger calculator but it had a really elegant, floating-point hardware design based on the 1960s version of a VLIW processor.

Cochran got the job of trying to come up with a way to unify the two architectures. He writes “I was looking at Osborne’s architecture and trying to figure out what an algorithm was. I even flew down to Southern California to talk with Jack Volder who had developed the CORDIC  transcendental functions used in the Athena machine and talked to him for about an hour. He referred me to the original papers by Meggitt where he’d gotten the pseudo division, pseudo multiplication generalized functions.

My job was to determine how many digits and what the operation time was required; what the architecture had to be; how many registers did we need, clock speed, etc? Other people were coming back with their inputs on cathode ray tube display, keyboards [and so on]. Should we use transistors or small-scale integration? There was no large-scale integration, but there was medium-scale integration, MSI which meant maybe 10 transistors in a chip.”

From Cochran’s architectural contributions, plus substantial work from other engineers in HP Labs and Tom Osborne (who remained an HP consultant for many more years), HP introduced the HP 9100 Scientific Calculator in late 1968. It’s a marvelous machine and it was a real design breakthrough for its day. It also weighed 40 pounds and drew 70 Watts out of a wall socket.

Now 70W isn’t all that much compared to the amount of electrical power you’d need to replicate the HP 9100’s computational abilities with a minicomputer or a mainframe, so you could consider this a lower-power design for its day. But it’s what happened next that really takes us to the domain of miraculously low power.

Cochran writes: “As soon as the 9100 started showing success in the market place Hewlett started to bug me personally. I know he also talked to Tom Osborne about it, what do you think, and so on. But he would come into the lab and he’d look for me and he’d say, ‘Hey, how are you coming with putting the 9100 in my shirt pocket?’ He said, ‘I want all that computational power in my shirt pocket.’”

OK. From 40 pounds and 70W to something that runs off batteries and fits in your shirt pocket. Now that’s a stretch. Oh, and one thing I forgot to mention. The HP 9100 calculator design had exactly one integrated circuit in it. That IC was used in the magnetic card reader. Osborne didn’t like the primitive ICs offered at the time. He didn’t design the card reader but the rest of the machine was implemented with discrete transistors, magnetic-core RAM, rope memory (go look that one up), and a large capacitive ROM fabricated from a 16-layer printed-circuit board. None of that technology was headed for a shirt pocket. Not in this universe.

Cochran then writes about his shirt-pocket epiphany: “Tom Whitney and I went down to Fairchild Semiconductor on Ellis Street, Mountain View, and they wanted to show us a calculator architecture that they were planning to provide to various companies that wanted to build calculators and semiconductors. So we went down to look at it. And I looked at it, and oh, this could do the algorithms. See, I knew. By this time, I had already fit the algorithms into a small-scale integrated machine, the 9100. So I knew exactly what architecture I needed, the capabilities of the architecture. I didn’t know what it was going to look like, but I knew what its capabilities had to be.

It was I think September of 1970; I saw a design that was different than anything else. It was not your classic computer architecture as taught at the universities. It was all shift register. It was designed for the technology at the time. When talking to the people at Fairchild I meet a fellow, Rich Whicker, who later came to work at HP. I said, “God, this design, did you think of this?” He says, “No. We got it from Sweda, the cash register company.” Sweda at the time was trying to make an electronic cash register or Point-of-Sale products and they were using shift registers.

Shift registers were the densest form of integration of integrated circuits at the time; you had to keep the clock moving and so on. It had no static memory. So here was a design using shift registers a 20-digit chipset that could satisfy anybody making a four-function machine. Add, subtract, multiply, and divide. It could give you the numbers as big as most people wanted, but it was all fixed point. 20 digits should be more than enough for anybody. You could have the decimal point anywhere in that stream.

I got really cranked up about seeing that architecture at Fairchild, I got very excited. And I’m whispering in Tom Whitney’s ear, ‘God, this is great.’ And I’m trying not to be too excited while I’m there. When we drove away from there and, I said, ‘God, that’s exactly—you know, I can tweak that architecture just a little bit. We don’t need the full 20 digits, we can do this and this and this. And gosh, yes, I can do it, I can do it, I can do it.’”

And that’s when the HP 35 Pocket Scientific Calculator crossed over from the realm of the impossible to the realm of the possible. When David Cochran thought that it could. Two years later, in 1972, it became a reality. A pocket scientific calculator that ran on three NiCd batteries in a pack. The rest, as they say, is history.

Be sure to read the whole story at http://www.hpmemory.org/timeline/dave_cochran/a_quarter_century_at_hp_00.htm#chapter_08.

Posted in Design, Low-Power | Tagged , | Leave a comment

The return of magnetic memory? A review of the MRAM panel at the Flash Memory Summit

Prior to 1970, magnetic memory in the form of little ferrite cores dominated the computing landscape. According to Wikipedia, the first core memory was installed in the groundbreaking MIT Whirlwind computer in 1953. (Note: I have a copy of the Encyclopedia Brittanica Yearbook for 1953 that mentions the development of magnetic-core memory by RCA.) That first core memory installed into Whirlwind held 1024 16-bit words, so it was a 2-Kbyte memory. For the next 30 years, computers used core memory for fast data and instruction storage almost exclusively. Mainframes used it. Minicomputers used it. Even HP’s first desktop calculator, the HP 9100 introduced in 1968, used a 2208-bit lithium-ferrite core memory to store numbers in its three working registers. Intel’s introduction of practical, commercial semiconductor memory in 1970—the famous and infamous 1103 DRAM—completely altered the landscape because efforts to automate the manufacture of core memory always met with failure. They always had to be hand-threaded while semiconductor memory—once debugged—enjoyed the immense advantages of lithographic mass production. Semiconductor memory in the form of DRAM, SRAM, EPROM, EEPROM, and Flash memory has dominated the digital world for the last 40 years.

Wikipedia also says that the age of magnetic memory ended in 1975. But perhaps magnetic memory didn’t die in 1975. Perhaps it was just sleeping. That could easily have been the theme of the panel on MRAM (magnetic RAM) that took place at the Flash Memory Summit, held at the Santa Clara Convention Center last month, which featured representatives from several companies developing MRAMs including Avalanche Technology, Crocus Technology, Everspin Technologies, and newcomer Spin-Transfer Technologies LLC.

The panel opened with some remarks from moderator Alan Niebel of Web-Feet Research, who noted that research on MRAM started 30 years ago with early experiments on GMR (giant magnetoresistive) MRAM cells and on magnetic tunnel junction (MTJ) MRAM cells. Motorola Semiconductor was an early pioneer in developing MRAM and its work passed through to the semiconductor spinoff now called Freescale, which further spun off the MRAM efforts to Everspin in 2008. Note, I’m pretty sure that “Everspin” is supposed to refer to the magnetic moment caused by spinning electrons that makes MRAM (and all other magnetic devices) work and not to the apparent desire to constantly spin off the MRAM technology to another company.

Increasingly, said Niebel, the interest in MRAM is centering around the latest and greatest incarnation of the MRAM cell, called STT (spin-torque transfer), which promises low-power operation and smaller cells that could lead to immense density improvements. Currently, Everspin’s devices are in the Mbit density range. STT cells will bump that density to Gbits.

The first panelist to speak was Dr. Rajiv Ranjan, CTO and founder of Avalanche Technology—founded in 2006. He started his talk by saying that the materials science behind MRAM technology is well understood because it’s the same magnetic material used in hard drives. In fact, said Ranjan, storage built from MRAM could be viewed as “hard drives that don’t spin [mechanically].” The physics of this magnetic material was all published 20 years ago and that material has no wearout-failure mechanisms. Consequently, MRAM cells have essentially infinite endurance in contrast to other competing nonvolatile-memory technologies such as NAND Flash memory or phase-change memory.

Ranjan noted some important gating factors for MRAM to become successful. First, it must be compatible with CMOS logic because that’s the process driver for the underlying circuitry that will drive the MRAM cells. Second, the MRAM cells must therefore have switching voltages on the order of 0.5V. Currently, the company has MRAM cells working at the size of 15 F-squared (the area measurement favored by semiconductor memory makers) and that the company is shooting for 8 F-squared for its ultimate cell design.

The next panelist to speak was Barry Hoberman, Chief Marketing Officer of Crocus Technology. Hoberman said that Crocus has developed a “Magnetic Logic Unit” (MLU) architecture that can be used to implement MRAM cells or pulsed logic circuits (as opposed to the non-pulsed logic gates generally in use today). The MLU cell architecture could be used for CAM, secure memory, pattern matching, and look-up tables said Hoberman.

He then started to discuss the stability of the magnetic material used for MRAMs. You need to make the material “soft” enough so that you can write the cell but not so soft that it can be changed by external factors. With conventional MRAM, said Hoberman, “you lose stability as you scale.” Crocus currently uses a thermally assisted switching (TAS) mechanism for its MTJ cells, which decouples scaling from stability. Each cell has a heater that makes the cell softer magnetically during a write.

Steffen Hellmold, VP of marketing at Everspin, chose to emphasize Everspin’s early entrance into the MRAM market. He said that Everspin currently offers the only commercial MRAM parts to date. Everspin has shipped 3 million devices to more than 300 active customers so far and expects to ship another 300 million pieces during 2011. The company currently offers more than 70 different part numbers in x8, x16, and serial I/O configurations.

Hellmold echoed Ranjan’s earlier assertion that MRAM doesn’t offer “virtually” unlimited read/write endurance, it offers genuinely unlimited endurance. Further, he said, MRAM offers instant on/off capability. System shutdown can therefore be immediate. There’s effectively no latency involved with the final write during a shutdown because the write cycle for an MRAM cell is on the order of nanoseconds not milliseconds as it is for NAND Flash. In addition, there’s no boot time required during power-up because MRAM used as the local RAM saves the state of the system in place when powered down. Finally, Hellmold predicted that Everspin would be the first company to commercialize STT MRAM.

Steve Cliadakis, the General Manager of newcomer Spin Transfer Technologies, didn’t have much to say at the panel because his company is quite new. Like the other companies mentioned above, Spin Transfer Technologies working on spin-torque-transfer MRAM because of the small cell size and small write currents. The “twist” in Spin Transfer Technologies’ work is that it is developing a specially aligned magnetic layer that is polarized orthogonally to the write current through the MRAM cell. When questioned, he said this was a matter of changing the way the electrons were spin-polarized as they pass through the polarization layers that form part of any STT MRAM cell. Spin Transfer Technologies believes that this differentiating technology will permit the development of even smaller, faster MRAM cells that require even less write current. In experiments, the company has seen individual MRAM cells that switch in 100 psec and 99% of the cells created in early experiments switch in less than 1 nsec. Also, said Cliadakis, the company believes that sub-100-psec devices should be feasible.

Posted in MRAM | Tagged , , , , | 2 Comments

Imagine no uninterruptible power supplies. I wonder if you can. A sad story of six fried hard disk drives

This is the story of six fried hard disk drives and why they died needlessly of heat failure as told to me by my good friend Ron Sartore, founder and CEO of AgigA Tech, at this month’s Flash Memory Summit. It’s also the story of the disaster’s aftermath and why it shouldn’t happen again—to anyone. Finally, it’s the story of why we just might need to reconsider our views regarding the use of uninterruptible power supplies to supply emergency back-up power to servers. Sometimes, truth is stranger than fiction.

AgigA Tech’s servers sit in a small room in the company’s San Diego corporate headquarters. One recent Saturday, these servers were quietly doing their jobs when an unscheduled power outage occurred. Thanks to uninterruptible power systems backing up the servers, the servers continued to do their jobs. Unfortunately, the air conditioning system at AgigA Tech headquarters has no backup power and did not continue to do its job. The temperature in the server room at AgigA Tech started to climb from the waste heat being thrown off by the equipment.

There was no one in the building to notice.

Eventually, power from the grid came back on and the cooling system restarted. However, by that time the temperature in the server room had climbed high enough to cook six of the hard disk drives in the server room’s RAID arrays. Fortunately, the RAID arrays performed as designed and there was no data loss. But if the power outage had lasted longer and if the servers had continued to run without cooling, eventually all of the RAID drives would have died. Even the best RAID system cannot preserve data when all of its drives fail. As it stands, even the hard disk drives that did not fail are suspect because they’ve been heat stressed. They too need to be replaced quickly before they fail prematurely as well.

After this small weekend near-disaster, my friend Ron Sartore started to ponder the ramifications and lessons of the incident. First, he realized that his servers should be sensing ambient temperature and shutting down gracefully when the server room overheats. In fact, Ron sort of assumed that’s what would happen. Bad assumption, as it turned out. “How often do you test that?” asked Ron. “No one wants to pull the plug on these things to find out” he added. I’ll bet a lot of business owners with little server rooms make precisely the same assumptions that Ron and his IT team did when they designed their server room.

Next, Ron began to think about the uninterruptible power supplies from an engineering perspective, which really calls into question the entire concept of uninterruptible power supplies for servers. There’s clearly no reason to continue to operate servers during a power outage if there’s no cooling available. In fact, there’s a clear reason to shut them down as quickly as possible to prevent overheating and hardware failure due to lack of cooling.

Now large data centers—like the ones operated by Amazon, Microsoft, and Google—have on-site Diesel generators that operate both the servers and the data center’s cooling systems during a power outage. These companies cannot afford to have their servers shut down. Every minute—actually every second—that the servers are down means lost revenue, lost profit, and lost customers. But most companies are not Amazon, Microsoft, or Google.

There are hundreds of thousands of companies in the US and millions in the world that run their servers in small server rooms or even closets where there are uninterruptible power systems designed to keep servers running as long as possible but where there is no backup power for cooling. For these companies, their server-system designs will cook and kill their hard drives rather quickly in the event of a power failure. We design smaller server systems this way almost without thinking. The UPS is a check-box item, meaning we don’t even think critically about including one.

Ron’s story made me think about UPS costs—both the acquisition cost and the operating cost over the life of the UPS. You see most uninterruptible power systems are designed so that they always supply power to the attached servers. Even while a UPS is running from the power grid, it’s still consuming and wasting energy. Today’s best UPS designs are perhaps 95% efficient. That means they consume about 5% of their input energy. All…the…time.

Less expensive UPS designs might be only 90% efficient. They waste about 10% of the energy consumed by the attached server(s).

In reality, a UPS that’s not operating at maximum load is likely to run at somewhat reduced efficiency because UPS manufacturers tend to rate their products’ efficiency at maximum load. In addition, IT managers prefer to overspecify the capacity of a UPS, which is normally good engineering practice but here it pretty much guarantees that the UPS will not operate at maximum efficiency.

Where does that wasted energy from the UPS go? It’s converted to waste heat, of course. Ron told me that his home UPS (purchased from Costco) runs hot even though it is lightly loaded—13W for a small PC appliance. Lightly loaded, a UPS might convert just as much power into waste heat as it delivers to the load.

And where does the waste heat from the UPS go? Right into the cooling system, assuming the cooling system is operational. During a power failure, it probably isn’t.

How much energy is needed for cooling? It depends on the cooling system. I’ve often heard from various presenters talking about energy needs for cooling data centers that it’s a 1:1 ratio—for every Watt of power emitted by equipment as waste heat, you need another Watt to remove the waste heat. Ron tells me that many of today’s cooling systems are actually more efficient than that. Some need only half a Watt to remove a Watt of waste heat from the server room—K factor of 2 for you HVAC engineers. Some high-efficiency cooling systems achieve a K factor of 3. So you can multiply the waste heat generated by a UPS by a factor of between 1.33 and 2 to determine the actual energy cost of UPS inefficiency.

Finally, we get down to computing the actual costs for using a UPS in a server room. Let’s start with a 2700W Dell UPS Short Depth Rack High Efficiency Online power-backup unit. When I looked it up, the purchase cost for the UPS was $1360.99 plus $115.68 for tax for a grand total of $1476.67. Shipping, at least, appears to be free. The UPS has a 3-year expected life and is rated as 95% efficient. About 5% of the energy it consumes is converted to waste heat when it’s fully loaded.

How much will it cost to run this UPS over its three-year expected life? Let’s use northern-California electric rates for commercial/general service from Pacific Gas and Electric where I live. PG&E charges about $0.20/kWh in the summer and about $0.15/kWh in the winter. On average, that’s about $0.175/kWh over the course of a year. Waste heat from the 2700W Dell UPS is about 135W (95% efficient) with another 70W or so needed to remove the waste heat through the air-conditioning system (assuming a K factor of 2). Total power required to have the UPS online all the time in case of a power-grid failure is about 200W continuously because the power for the servers is always flowing through the UPS.

Do the math and it works out to about $0.84 per day just to run the UPS, which is $306.60 per year in electricity cost for power-failure insurance. Over the three year rated life of the UPS, you’ll spend another $919.80 to run this UPS continuously—nearly as much as you spent for the initial UPS purchase. The total cost of adding this UPS to your system is about $2400 every three years using a back-of-the-envelope sort of calculation. If the UPS or air-conditioning systems are less efficient than the ones used in this example calculation, then the energy costs will be higher.

Now let’s be crystal clear here. Ron Sartore isn’t professionally disinterested in this story. He’s not exactly objective. His company, AgigA Tech, makes a line of AGIGARAM DDR2 and DDR3 memory modules that combine DRAM and NAND Flash memory with a controller on board that can independently transfer data back and forth between the module’s DRAM and NAND Flash without going through the server processor. In the case of a power outage, the on-board AGIGARAM controller backs up the data in the DRAM and puts it into the NAND Flash on the memory module using a small amount of standby power supplied by an independent bank of ultra capacitors connected directly to the memory module. Once backed up, the server data is safely stored for 10 years in the NAND Flash even without power. Standard servers and server-management software aren’t designed to exploit the features of this type of server memory that can safely back itself up. So even AgigA Tech’s IT department can’t configure a standard server system using AGIGARAM. At least not yet.

Bottom line, you or your customers may well be spending thousands of dollars for acquiring and powering a server UPS but you will not get the power-failure insurance you expect. You will not get a system that protects your data very well in the event of a power failure, as AgigA Tech has discovered. Instead of reliable backup, equipping a small data center with a UPS can cause hard disk drives to fry should there actually be a power failure—as they did at AgigA Tech—and it costs thousands of dollars extra in UPS costs to allow this to happen. “We paid good money to self-destruct ourselves” said Ron. Even though no data was lost in this example because the drives that failed were installed in RAID arrays, this approach seems like very poor engineering design. Ron has convinced me.

PS: While writing this blog entry, I received a letter from the IEEE. The envelope prominently featured this statement on the outside: “The BEST project plans include dependable backups for ‘out-of-the-blue’ accidents.” Although it might appear that the IEEE was thinking about this very blog entry when it mailed this letter to me, it turns out they’re just trying to sell me accidental death and dismemberment insurance. However, the coincidence is uncanny. In reality, that’s exactly what we’re discussing here—reducing the cost and improving the effectiveness and energy efficiency of accidental death and dismemberment insurance for servers.

Posted in Design, DRAM, Low-Power | Tagged , , , , | Leave a comment

Texas Instruments’ PowerStack package proves that low power is a killer app for 3D packaging

When you’re trying to eke every percentage of efficiency from a design, you leave no stone unturned. Yet an IC package is often a stone left unturned because it’s often entirely under the control of the chip vendor and besides, how much difference could the package make? Actually, it could make a significant amount of difference as it turns out. Recently, Texas Instruments revealed that it has shipped more than 30 million dual-MOSFET half-bridge drivers in a 3D PowerStack package and the company has published a report detailing the critical physical characteristics of this 3D package including the power-enhancing aspects of the design.

First, understand that the current PowerStack package houses two MOSFET die—as in two transistors. Normally for 3D discussions about stacking logic, memory, analog, and RF die we talk about millions and even billions of transistors. Here, we’re talking two transistors. However, this approach to 3D assembly is applicable to other hybrid power devices that include smart driver chips, which might contain thousands of transistors or more and could include entire processor-based, firmware-driven embedded systems if needed. So the TI PowerStack technology has broader scope than the immediate application might imply.

The PowerStack package marries a high-side MOSFET driver and a low-side MOSFET to create a half-bridge driver as shown in the following diagram:

In the top diagram, the two MOSFET die are connected side-by-side, which requires longer wire bonds that result in unwanted inductance and resistance. Contrast that conventional approach with the PowerStack package shown in the lower part of the above diagram. Here the two MOSFET die are stacked, proving a very low impedance path from ground to the low-side MOSFET’s drain and from the low-side MOSFET’s source to the high-side MOSFET’s drain. The reasons these paths have such low impedance is because they’re realized with relatively large copper bus bars, which TI calls “clips.” A cross-section microphotograph of the assembly looks like this:

Low-impedance bus bars effectively make the parasitic resistances and inductances of the wire-bond leads in the side-by-side package disappear in the ground leg and in the low-side-to-high-side MOSFET circuit connection. The following three circuit diagrams compare the equivalent circuits of the side-by-side and PowerStack packaging approaches:

These three circuit diagrams show three equivalent circuits. The left-hand circuit shows the conventional schematic for a wire-bonded two-MOSFET side-by-side hybrid and this circuit diagram includes all of the resistive and inductive parasitic components introduced by conventional, wire-bonded packaging. TI’s PowerStack 3D packaging with its large copper bus bars make four of those parasitic components—two resistors and two inductors—“effectively” disappear as shown in the diagram’s center schematic drawn with red “X”es through four of the parasitic elements. The resulting, much-simpler circuit equivalent appears in the right-hand schematic of the above diagram.

The elimination of these resistances and inductances make an appreciable reduction in the amount of energy converted to waste heat by the half-bridge driver, which can operate at upwards of 25A. When you’re switching that much current, milliohms make a difference. How much difference? The following graph of the efficiency curves from the TI report tells the story:

The red line in the graph shows the efficiency of the side-by-side half-bridge package. The upper dark line shows the efficiency of the PowerStack package. Note that with 20A to 25A output current, there’s about a 2% difference in operating efficiency.

Is 2% a big deal?

Well, look at it this way. The half-bridge circuit’s efficiency rises from about 90% to about 92% at 20A because of the PowerStack 3D package. That’s only a 2% increase in efficiency, but for waste heat you might be better off considering the inefficiency, which drops from about 10% (1 minus the efficiency measured in %) to about 8%. That’s a 20% reduction in inefficiency.

So does that matter, or is that TI marketing math? Well, here’s another figure to tell the story.

These two images show the thermal signatures of the driver MOSFET packaged in the side-by-side and PowerStack packages. Note that in the top image, the hybrid half-bridge made with side-by-side packaging runs at 118ºC while carrying 25A in still air. The lower image shows that the PowerStack version of the half-bridge driver runs at 88.4 ºC under the same operating conditions. One reading is hot enough to boil water. The other isn’t. The difference is the amount of heat wasted in the package parasitics. That’s indeed concrete proof that packaging alone can make a power difference in a design.

TI’s PowerStack construction delivers substantial benefits to designers: better power efficiency, improved thermal management, and board-space savings. The above discussion focused on the power savings and improved thermal management. However, the space savings can also be important for mobile and portable devices. Because the two power-MOSFET die in the TI PowerStack package are stacked, the resulting packaged device has roughly half of the footprint of a similar component that arranges the MOSFETs side-by-side. That’s especially important for mobile and portable devices where every cubic millimeter not occupied by electronics tends to be filled with battery. However, even a Web server’s overall volume becomes important when you’re sticking hundreds of thousands of them in a data center. The power and thermal efficiencies are also important in data centers where half of the centers’ energy budgets go to powering and half to cooling the servers racked in the building.

For a very informative video describing this 3D Packaging technology, click here.

Posted in Low-Power | Tagged , , , | Leave a comment

Will Compaan’s HotSpot Parallelizer technology take us to the promised land of parallel computing?

In connection with my just-written blog entry on the massively parallel SpiNNaker project (see below), I want to relate some information about another meeting I had last March at the DATE (Design Automation and Test) conference in Grenoble, France. I met with Compaan (www.compaandesign.com) and got a presentation on the company’s HotSpot Parallelization technology.

Here’s how it works. You start with application code written in C. You add pragmas around known code hotspots to switch on Compaan’s HotSpot Parallelizer and to switch it off. You discover these hotspots using regular code-analysis techniques already used for other sorts of software-specific optimizations. So far, nothing new here.

Then you submit the code to the Compaan HotSpot Parallelizer for analysis. The Parallelizer analyzes the code and creates a Kahn Process Network (KPN, http://en.wikipedia.org/wiki/Kahn_process_networks) that consists of many independently executable processes and the communications linkages needed to pass data between these processes. What you then end up with is several independent C programs that can be compiled and run on one processor, run on several processors, run or on some mix of processors and hardware built using a C-to-hardware compiler. Here’s a picture of the process:

The advantage of this approach is that it’s entirely automatic once you mark the hotspots with pragmas. To use this approach, your design will need to consist of deterministic, independent processes. Parallelization consists of creating a Kahn Process Network and then generating C code for the various independent programs. That generated code must include all of the inter-program communications needed to operate the KPN. Inter-program communications take place through FIFOs, which might be real hardware FIFOs or, more likely, FIFOs implemented in shared memory.

You could do this by hand and in a simple system you can do this by hand. In a complex system, you’ll want all the automation you can muster because otherwise the complexity will kill you, your team, and your project.

Once you have the several C programs that constitute the KPN, you can decide where each will execute. Some might execute on the same processor. That’s convenient because the inter-program communications is simple and takes place in the processor’s memory space. However, you’ll get no acceleration by running everything on one processor. In fact, you’ll likely slow things down with the added inter-program communications overhead. So, you might choose a multicore processor. Compaan’s HotSpot Parallelizer would seem to be a fast way to accelerate code execution for multicore designs. You might also wish to take some of the C programs in the KPN and transform them into hardware to maximize the acceleration potential. It’s your choice, based on cost/performance tradeoffs that are familiar to every System Realization team.

Compaan’s got some case histories that are certain to interest you. Just ask them to share.

Now the reason I’m writing about this product in my Low-PowerDesign blog is because you must use parallelization to drop power consumption. Stacking every possible task on one multi-GHz processor is not going to result in low-power operation and we should all know that by now. However, there are naysayers who tremble and say “we don’t know how to code for parallel execution.” Well, Compaan’s HotSpot Parallelizer apparently does.

Posted in Design, Low-Power, Multicore | Tagged | Leave a comment

Think Globally, Act in Parallel. What can you do with one million ARM cores acting in parallel and how do you get there?

Professor Steve Furber’s SpiNNaker project is in the news again. I wrote about Furber’s massively parallel brain-emulation project back on March 30 after listening to his keynote at this year’s DATE (Design Automation and Test Europe) conference in Grenoble, France. (See “The incredible vanishing power of a machine instruction. Is this the way to the brain?”) Furber’s DATE keynote title says it all: “Biologically-inspired massively-parallel architectures—computing beyond a million processors.” Furber and his team are referencing nature to help them tackle the really hard processing problems we need to solve in the future through massively parallel, brain-like computing. Brain-like computing—go slow, go wide, go massively parallel—seems to offer a proven, low-power approach to solving some of these big computational problems.

The SpiNNaker project is again in the news at EETimes Europe (see “A million ARM cores to host brain simulator”) and the idea of harnessing one million ARM processor cores is certainly a big idea. It excites me. However, we’re still at the humble beginnings of the project.

The SpiNNaker project’s first test chip harnesses 18 ARM9 cores on one 130nm chip manufactured by UMC in Taiwan. This is a 100M-transistor chip and, like most many-processor SoCs, the SpiNNaker SoC mostly consists of memory. The memory needs to be close to the processors for speed and for low-power consumption and there are 55 32Kbyte SRAM blocks on the SpiNNaker die. That’s 14 million bits of SRAM and, frankly speaking, that’s really not very much SRAM. Eighteen processors isn’t really a large number of processors either when your stated goal is one million.

The ARM processors on the SpiNNaker chip use packet communications to emulate the electrical spike communications that occur among the neurons in human and animal brains. From a hardware perspective, I think it’s easy to conceive of a system-level design like this and even conceptually scaling the design to a million connected ARM9 processors isn’t really hard, as long as you don’t try to enumerate all of the processors in your mind. However, with 18 processors per chip, you’ll need approximately 55,600 chips to build an interconnected network of one million processors. That’s still a mighty big box of hardware. More on that in a bit.

The rub is that we really don’t have many good ideas for programming such a massively parallel system. The SpiNNaker project seems to be mostly a hardware endeavor with the explicitly stated intent of developing a hardware testbed for brain researchers who will use SpiNNaker systems for studying various theories of brain function. Presumably, we’ll learn more about massively parallel programming by working with these systems and no doubt we will. As Furber says in a quote published in the EETimes Europe article, “We don’t know how the brain works as an information-processing system, and we do need to find out. We hope that our machine will enable significant progress towards achieving this understanding.”

Each SpiNNaker chip in the current design is bundled with a 166MHz, 1Gbit DDR SDRAM and packaged in a 300-pin BGA package. But we’re not going to be building million-processor testbeds with 18 processors per packaged chip. I’m almost absolutely, positively certain about that. This first SpiNNaker prototype just doesn’t scale to one million processors very easily. So the question is, how to get there?

Well, possible clues to answer that question can be found in two recent blogs that I wrote on the EDA360 Insider blog. First, Samsung has just announced successful tapeout of a 20nm test chip incorporating an ARM Cortex-M0 processor core. (See “Samsung 20nm test chip includes ARM Cortex-M0 processor core. How many will fit on the head of a pin?”) Now an ARM Cortex-M0 processor is not as powerful as an ARM9 processor, but then it’s not supposed to be. It’s designed for control-oriented applications and its 3-stage execution pipeline isn’t designed to get maximum speed from any given process technology. However, we’re building a system that emulates a brain that operates at a few hundred Hertz (that’s Hertz, not kilohertz, megahertz, or gigahertz) so I really don’t think the clock speed is all that critical when you’re talking about a million processors. The ARM Cortex-M0 processor core is still a 32-bit RISC processor and I am guessing with a high degree of confidence that it’s fully up to the task of executing the required electrical-spike calculations, albeit not quite as quickly as an ARM9 processor.

What’s interesting about a 12-to-14Kgate ARM Cortex-M0 processor implemented in 20nm process technology is that my calculations suggest that more than half a million ARM Cortex-M0 processors would fit on a chip the size of an Intel “Tukwila” Itanium processor (OK, that’s a big chip, but it’s a commercial one) and that calculation is based on the published number for the area required by an ARM Cortex-M0 implemented in 90nm process technology, not 20nm. Now there’s a lot of slop in this calculation. First, there’s the disparity of using 90nm numbers instead of 20nm numbers. Then there’s the disparity caused by putting no memory at all into the calculation. I just mentally tiled processors edge to edge. Ditto, there’s no on-chip interconnect.

So you probably won’t get half a million ARM Cortex-M0 processor cores on one 20nm chip. But you might get 100,000 or 200,000 ARM Cortex-M0 processor cores on a chip along with an interesting amount of memory and the required interconnect. Now we’re talking about only a handful of chips to get to one million processors. We’re talking about a tabletop box. Now we’re getting into the realm of the feasible for million-processor systems.

The second related blog entry I recently wrote in EDA360 Insider that also bears on this very interesting endeavor was about an announcement from Imec, a global research company. Just days ago, Imec announced that it and its partners successfully assembled a custom logic chip with two DRAMs in a stacked 3D configuration. (See “3D Thursday: IMEC prototypes 3D chip stack, finds some thermal surprises”.) This 3D stacked-chip prototype allowed Imec to test out some process ideas for manufacturing 3D stacked chip assemblies and to make some critical thermal tests to verify thermal models that will be so necessary when 3D assembly goes mass market. The 3D chip stack uses copper-tin micro-bumps and compression bonding for the electrical and mechanical assembly of the chip stack and you can see photos of the assembled stack below.

Here’s a photo of the overall chip stack:

And here’s a close-up of the edge of the chip stack to show the three stacked die.

The 3D Stack’s base chip is approximately 750µm thick. The two top components in the chip stack are each 25µm thick. There’s more technical info in the referenced EDA360 Insider blog.

I am convinced that 3D stacking of logic and RAM chips will be absolutely essential to developing massively parallel, low-power systems like the ones envisioned by the SpiNNaker project. First, the only way to feed data and instructions to massively parallel processing chips is through large amounts of on-chip memory and through high-bandwidth, low-energy channels connected to large off-chip memories. 3D assembly techniques permit both Wide I/O and high-speed serial I/O channels to work most effectively and at minimal energy levels and I expect to see rapid adoption of 3D assembly—even and perhaps especially in high-volume, cost-sensitive applications such as mobile phone handsets—in the next few years. This is precisely the sort of manufacturing technology we require to think seriously about million-processor systems.

Now all we need to do is figure out how to program them.

Posted in ARM, CMOS, Design, DRAM, Low-Power, Networking, SDRAM, SOC, SRAM | Tagged , , , , , | Leave a comment

Cadence’s Qi Wang discusses the use of good methodology for low-power, advanced IC designs

You can read Qi Wang’s writeup of a paper on low-power IC design presented by Global Unichip’s Alex Kuo here.

Posted in Low-Power | Tagged , | Leave a comment

Richard Goering discusses the low-power aspects of 40nm and 28nm design with Global Unichip’s Alex Kuo

Cadence blogger and long-time EDA editor Richard Goering spent some time at the recent DAC event in San Diego discussing the finer points of 40nm and 28nm design with Global Unichip’s Alex Kuo. Among the interesting tidbits from the interview are the increased use of IP blocks and how that complicates clock trees, the use of DVFS (dynamic voltage and frequency scaling), and how low-power description formats help reduce power consumption during design. You’ll find the interview here.

Posted in Low-Power | Tagged , | Leave a comment

Need to cut IP power? (Who doesn’t?) “Press here” says Calypto

All SoCs are built with IP blocks. Some of those are legacy IP blocks. Some are purchased from other vendors. Some are developed in-house. All of them draw power—static and dynamic power. At nanometer lithographies, the way to cut static power is through circuit tricks like high-Vt transistors and by powering down entire blocks when not needed. The way to cut dynamic power within an IP block is to stop clocking anything that doesn’t need to be clocked. Designers can gate clocks during the development of an IP design but what about existing IP blocks? Some can be retrofitted with clock gating but the ease of that exercise depends on how familiar the IP designer is with that IP block and how well documented the block is.

 

Face it, some most IP blocks aren’t that well documented. You may never know enough about the internals of a purchased IP block to fiddle with its clocking. Legacy IP blocks may have been long abandoned by their designers who have gone off to other tasks, other companies, other planes of existence. Even a block you’ve designed yourself may have scrolled off your own internal memory window long ago.

 

Designers everywhere have a common solution for these sorts of problems. “Give me a tool to do this” they demand from EDA vendors. “I just want to push the button.”

 

Usually, that’s easier said than done. Calypto’s got a tool you can try however. It’s called PowerPro and comes in two flavors: CG and MG. The CG flavor is based on the company’s SLEC sequential logic equivalency checker. That’s a tool that checks to see if modified IP block “A prime” works the same as original IP block “A.” It’s a general-purpose EDA tool with a variety of uses and one of those uses is for comparing an IP block’s function before and after clock gating.

 

Calypto’s PowerPro CG encapsulates the SLEC EDA tool to produce a “done for you” tool that can automatically insert clock gating into an IP design. It also checks to make sure the IP block’s behavior doesn’t change as a result of the added clock gating. Usually the insertion process takes 4 to 8 hours according to Calypto CEO Doug Aitelli who spoke to me about the product at DAC 2011 in San Diego. What do you get for this overnight run? Usually 10% to 30% reduction in dynamic power said Aitelli. Sometimes as much as 60%. Not bad for “pushing the button” I’d say.

 

There’s another flavor of PowerPro called PowerPro MG. Nope, not named for a cute little British sports car, “MG” stands for “memory gating.” We tend to forget that today’s SoCs are more than half memory measured by die area. Usually SRAM. We sort of allude to this fact when we talk about MPSoCs—multiple processor SoCs. With each of those processors comes a boatload of on-chip SRAM for fast execution. However, we don’t seem to explicitly call out the memory. We tend to ignore it. I guess MMSoC—mostly memory SoC—doesn’t have the same cachet as MPSoC in our processor-centric world.

 

However, if more than half of an SoC is SRAM, it makes sense to pay some attention to reducing the power consumption of an SoC’s on-chip SRAM blocks. That’s what Calypto’s PowerPro MG does. It can automatically add memory gating to an SoC design by evaluating the design’s behavior across many cycles.

 

It also goes a step further. Many SRAM blocks for SoCs now have a sleep mode where the memory’s operating power can be reduced by shutting down peripheral circuitry such as address decoders and sense amps while keeping the memory storage array alive. According to Calypto’s Aitelli, most SoC designers find these sleep modes too hard to use, so they simply don’t use them. They don’t have the time. But those sleep modes are still there just waiting to be used. PowerPro MG will add the necessary sleep/wake-up state machine to exploit this little-used memory feature. Push the button, save power.

 

Just a story from a chance meeting at DAC. Par for the course. There’s always something new to learn, something new to try.

 

To read my blog on the Low-Power Report Card Panel at DAC, click here.

Posted in Clock Gating, CMOS, Design, EDA, Low-Power, SRAM | Tagged , | Leave a comment

IBM Researchers Develop Planar, Monolithic, 1-Transistor Graphene IC—Make Graphene Party Like It’s 1959

This week in Science Magazine, IBM researchers published an article documenting the first graphene IC built using recognizable IC processing techniques. The simple 1-transistor, 2-inductor monolithic circuit operates as an RF mixer with a useful operating frequency of 10GHz. The operating speed is not especially impressive. The lithographic geometries are also unimpressive: a 550nm FET gate length is a process node that dates back well more than a decade for silicon IC processing, even if the researchers used e-beam lithography to draw the patterns.

IBM Graphene chip 

What is impressive? It’s the entire package. The thing that differentiated Fairchild Semiconductor’s planar IC concept from Kilby’s IC concept at TI back in 1959 was that the planar IC process could build circuits of increasing, arbitrary complexity using highly automated lithographic printing techniques. It was the beginning of mass production for electronics.

 

That’s what’s different about this process as discussed in the latest issue of Science Magazine. IBM’s researchers started with a silicon carbide wafer. They grew a two- or three-layer graphene film on the silicon face of the SiC wafer using high-temperature expitaxy. They then patterned the FET gate using PMMA (a transparent acrylic plastic commonly used for e-beam and nanoimprint lithographic processing) and hydrogen silsesquioxane (HSQ, a high-resolution e-beam photoresist) which they exposed with an electron beam. (The authors admit they could also have used more convnentional optical lithography for the geometries used in this experiment.) Researchers removed the excess graphene with an oxygen plasma etch. The FET’s gate dielectric is aluminum oxide, since silicon oxides aren’t to be normally found in this process. The inductors are patterned aluminum. The entire 3-element circuit operates similarly to an RF mixer built from more conventional silicon counterparts.

 

Don’t expect to see graphene ICs rolling off the production lines this year or next. That’s not what this demonstration is about. What this exercise proves is that you can indeed make graphene ICs with processing techniques familiar to anyone in the silicon IC manufacturing business. You can also make graphene FETs that operate at interesting frequencies using fairly large geometries by 21st-century standards even though graphene doesn’t have a natural band gap. As IBM’s press release points out, these same researchers have built graphene FETs with much higher operating frequencies using smaller gate lengths, but these earlier experiments did not employ assembly techniques resembling those in common use today for making silicon ICs. Now there’s an initial manufacturing process for mass production of graphene ICs. Now, it gets interesting.

Posted in Graphene | Tagged | Leave a comment