<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Steve Leibson &#187; SOC</title>
	<atom:link href="http://low-powerdesign.com/sleibson/index.php/category/soc/feed/" rel="self" type="application/rss+xml" />
	<link>http://low-powerdesign.com/sleibson</link>
	<description>Leibson's Laws and the Penalties for Breaking Them</description>
	<lastBuildDate>Wed, 01 Feb 2012 00:01:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Is 2012 going to be another breakout year for NAND Flash and Low-Power Design?</title>
		<link>http://low-powerdesign.com/sleibson/2012/01/09/is-2012-going-to-be-another-breakout-year-for-nand-flash-and-low-power-design/</link>
		<comments>http://low-powerdesign.com/sleibson/2012/01/09/is-2012-going-to-be-another-breakout-year-for-nand-flash-and-low-power-design/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 13:00:04 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[EDA]]></category>
		<category><![CDATA[Flash]]></category>
		<category><![CDATA[SDRAM]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[Video]]></category>
		<category><![CDATA[cadence]]></category>
		<category><![CDATA[IBM]]></category>
		<category><![CDATA[Micron]]></category>
		<category><![CDATA[NAND]]></category>
		<category><![CDATA[Nikon]]></category>
		<category><![CDATA[Samsung]]></category>
		<category><![CDATA[Sony]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=754</guid>
		<description><![CDATA[It’s just one week into the year, I am increasingly getting the feeling that 2012 is going to be a momentous, tumultuous year for semiconductor technology and low-power system design. Among the many recent events that are giving me this &#8230; <a href="http://low-powerdesign.com/sleibson/2012/01/09/is-2012-going-to-be-another-breakout-year-for-nand-flash-and-low-power-design/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It’s just one week into the year, I am increasingly getting the feeling that 2012 is going to be a momentous, tumultuous year for semiconductor technology and low-power system design. Among the many recent events that are giving me this feeling are the changes taking place in the NAND Flash arena. Nearly all low-power system designers depend on NAND Flash in some form because it is currently the technology of choice for storing code and data when a system is in deep low-power/sleep mode or when switched off. We use NAND Flash on chip for microcontrollers. We use NAND Flash chips on board for main storage in mobile phone handsets, tablets, eBook readers, and many other embedded systems. We use NAND Flash cards for removable storage in cameras, camcorders, mobile phone handsets, voice recorders, and media players. Any changes to NAND Flash technology ripple widely through the low-power design landscape like earth tremors.</p>
<p>At least three major changes to NAND Flash technology in the recent past have caught my attention. The first such event I want to discuss in this blog entry is the HMC or Hybrid Memory Cube that Micron first announced last year and is now in joint development with major partners including Samsung and IBM.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Micron-Hybrid-Memory-Cube.png"><img class="alignright size-full wp-image-756" style="margin: 10px;" title="Micron Hybrid Memory Cube" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Micron-Hybrid-Memory-Cube.png" alt="" width="252" height="186" /></a>I previously wrote about the HMC (see “<a href="http://eda360insider.wordpress.com/2011/12/01/3d-thursday-hybrid-memory-cube-does-anyone-know-whats-happening-with-ibm-and-micron/" target="_blank">3D Thursday: Hybrid Memory Cube—Does anyone know what’s happening with IBM and Micron?</a>”) and its design is for high-performance computing systems that require extremely high throughput: 1 Tbit/sec. (See “<a href="http://eda360insider.wordpress.com/2011/08/22/want-to-know-more-about-the-micron-hybrid-memory-cube-hmc-how-about-its-terabitsec-data-rate/" target="_blank">Want to know more about the Micron Hybrid Memory Cube (HMC)? How about its terabit/sec data rate?</a>”) The HMC is a DRAM example of the kinds of memory modules we’re likely to see from the marriage of 3D IC assembly techniques and advanced NAND Flash devices.</p>
<p>The HMC runs many, many TSVs (through silicon vias) up through a stack of as many as four SDRAM die to access the inherent parallelism of the multiple DRAM arrays on each die. Each proprietary DRAM die in the HMC stack has 16 separate memory arrays, resulting in substantial potential parallelism and consequently, substantial potential memory throughput.</p>
<p>However, the high-performance approach of the HMC is not the only way to harness 3D assembly and semiconductor memory. For example, at the end of last year, I wrote an extended blog describing a thought experiment that employed the HMC design concepts using Wide I/O SDRAM instead of the special NAND Flash chips in the HMC. (See “<a href="http://eda360insider.wordpress.com/2011/12/28/3d-thursday-lets-end-2011-with-a-high-performance-dram-memory-stack-design-how-would-you-improve-it/" target="_blank">3D Thursday: Let’s end 2011 with a high-performance DRAM memory stack design. How would you improve it?</a>”) Wide I/O SDRAM presents four independent 128-bit DRAM channels to the host system, resulting in a high level of memory parallelism. Just not as high as for the HMC. In fact, the performance is about half that of the HMC but it’s still pretty good. The same parallelism concepts could be applied to NAND Flash devices designed to a similar Wide I/O specification for NAND Flash. The lower interface speeds enabled by a Wide I/O memory interface port really drop power consumption while maintaining good performance through the parallelism uncovered by the access to the multiple on-chip memory arrays.</p>
<p>I have not heard of any efforts to adopt the Wide I/O interface spec to NAND Flash devices. Not yet. But the move to extracting parallelism from the arrays on all memory chips is too attractive to ignore in a world that perpetually thirsts for bandwidth at low power.</p>
<p>At the end of the year, two other announcements directly related to NAND Flash memory have caught my eye: the introduction of the XQD memory card format and the ONFI 3.0 interface spec. The Compact Flash Association <a href="http://compactflash.org/2011/compactflash-association-announces-the-first-video-performance-guarantee-vpg-profile-specification/" target="_blank">introduced</a> the XQD memory card format in December 2011. The XQD memory card has a slightly larger footprint than an SD memory card and a somewhat smaller footprint than a Compact Flash (CF) memory card. It’s as thick as a CF card. But the really big difference here is the interface to the memory card. The XQD memory card uses a PCIe (PCI Express) interface clocked initially at 2.5 Gbits/sec, resulting in a maximum write speed of 125 Mbytes/sec.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Nikon-D4-DSLR.png"><img class="size-full wp-image-757 alignright" style="border: 0px;" title="Nikon D4 DSLR" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Nikon-D4-DSLR.png" alt="" width="248" height="238" /></a>That’s really fast and speed is important when you’re shooting large images at a fast rate, which occurs during HD video recording and at high burst speeds in high-resolution digital still cameras. Both such conditions exist in the new Nikon D4 DSLR, which Nikon <a href="http://www.dpreview.com/news/2012/01/06/NikonD4" target="_blank">launched</a> just last week. The Nikon D4 DSLR can shoot 16.2 Mpixel frames at 10 to 11 frames per second. Normally, DSLRs use in-camera RAM to buffer burst-mode still captures but the Nikon D4 DSLR can accept the new XQD memory cards and Sony <a href="http://www.dpreview.com/news/2012/01/06/sony-xqd-memory-cards" target="_blank">introduced</a> the first series of such cards last week, concurrent with Nikon’s introduction of the Nikon D4 DSLR.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Sony-H-Series-XQD-card.png"><img class="alignright size-full wp-image-758" title="Sony H Series XQD card" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Sony-H-Series-XQD-card.png" alt="" width="162" height="227" /></a>Sony claims that its H Series XQD card can accept bursts of 100 uncompressed still images from the Nikon D4 DSLR in continuous shot mode. That’s a huge jump in burst length for a digital still camera and will be invaluable in shooting images of sports activities, for example.</p>
<p>One of the secrets behind the XQD card format’s performance is that PCIe interface port, which is also unique in that it is a memory interface and is not derived from a disk interface. That should mean that a host processor doesn’t need a disk controller to operate an XQD card. The card can be mapped to the host processor’s memory bus and the controller can reside in each memory card. Eliminating the disk controller from the serial chain between the processor and the Flash memory chips should cut costs, reduce power consumption, and boost performance.</p>
<p>All of those benefits are welcome in the world of low-power design. After all, do we really need controllers controlling controllers in an efficient system design? I don’t think so.</p>
<p>Now before you bemoan the need of a controller in each memory card, you should be aware that there already is a controller in each CF and SD memory card. You don’t think that NAND Flash arrays already look like disk drives, do you? We do indeed currently have controllers controlling controllers in existing NAND Flash memory subsystems.</p>
<p>A PCIe interface spec should simplify things somewhat.</p>
<p>The third development that’s caught my eye in the Flash memory arena is the announcement of the ONFI 3.0 interface specification for Flash memory. The ONFI (Open NAND Flash Interface) Working Group <a href="http://onfi.org/news-events/onfi-announces-publication-of-the-3-0-standard-pushes-data-transfer-speeds-to-400-mbsec/" target="_blank">introduced</a> the third major revision of the ONFI spec nearly a year ago, in March 2011. What’s new is that there are now products appearing that use ONFI 3.0.</p>
<p>The advantage of the new ONFI specification is that it doubles transfer rates to 400 Mtransfers/sec using the NV-DDR2 200MHz double-data-rate (DDR) protocol while adopting 1.8V SSTL_18 signaling to cut the power dissipation of the interface. See a pattern evolving here? More performance and less power consumption. The question is whether or not ONFI 3.0 is real or not. Well, the memories now seem real because <a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Intel-Micron-128Gbit-ONFI-3-Flash-chip.png"><img class="alignright size-full wp-image-759" title="Intel Micron 128Gbit ONFI 3 Flash chip" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2012/01/Intel-Micron-128Gbit-ONFI-3-Flash-chip.png" alt="" width="300" height="261" /></a>Intel and Micron jointly <a href="http://newsroom.intel.com/community/intel_newsroom/blog/2011/12/06/intel-micron-extend-nand-flash-technology-leadership-with-introduction-of-worlds-first-128gb-nand-device-and-mass-production-of-64gb-20nm-nand" target="_blank">previewed</a> a 128Gbit NAND Flash device in December with the derivative 64Gbit NAND Flash device going into production now. According to the joint Intel/Micron announcement, the 128Gbit device will be in volume production later this year after a “rapid transition” from the 64Gbit device.</p>
<p>However, an ONFI 3.0 memory device isn’t sufficient. You also need a controller on an SOC that can operate ONFI 3.0 devices. Cadence just <a href="http://www.cadence.com/cadence/newsroom/press_releases/pages/pr.aspx?xml=010912_onfi3" target="_blank">introduced</a> an ONFI 3.0 NAND Flash controller IP block and companion PHY IP today along with appropriate verification IP so it’s now possible to include an ONFI 3.0 NAND Flash controller in an SoC design using the standard ASIC flow.</p>
<p>As you can see, there’s a tremendous amount of new technological development going into NAND Flash memory and I see big things ahead this year, all to the benefit of low-power system designers.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2012/01/09/is-2012-going-to-be-another-breakout-year-for-nand-flash-and-low-power-design/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2011: A great year for low-power design, wasn’t it?</title>
		<link>http://low-powerdesign.com/sleibson/2011/12/17/2011-a-great-year-for-low-power-design-wasn%e2%80%99t-it/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/12/17/2011-a-great-year-for-low-power-design-wasn%e2%80%99t-it/#comments</comments>
		<pubDate>Sat, 17 Dec 2011 18:46:25 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[2.5D]]></category>
		<category><![CDATA[ARM]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[Microcontroller]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[2000T]]></category>
		<category><![CDATA[NXP]]></category>
		<category><![CDATA[Virtex-7]]></category>
		<category><![CDATA[Xilinx]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=734</guid>
		<description><![CDATA[2011 was a great year for low-power design. I don’t think I can remember a year as good to low-power designers. I thought I’d devote this blog to a review of some major developments in 2011 that made low-power designers’ &#8230; <a href="http://low-powerdesign.com/sleibson/2011/12/17/2011-a-great-year-for-low-power-design-wasn%e2%80%99t-it/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>2011 was a great year for low-power design. I don’t think I can remember a year as good to low-power designers. I thought I’d devote this blog to a review of some major developments in 2011 that made low-power designers’ lives easier. In fact, there’s so much to talk about that I’m splitting this blog post in two. In the first half, I’ll write about significant developments in standard silicon offerings including microcontrollers, embedded application processors, and FPGAs. In part B, I’ll discuss some of the year’s most significant developments in design at the silicon level and the implications for people who design ASICs, SoCs, and ASSPs. It truly was a bountiful year.<strong> </strong></p>
<h3><strong>Low-power microcontrollers</strong></h3>
<p>If there ever was a year for microcontroller advancement, this was it. Every major microcontroller vendor had something new and exciting on the low-power front. So many developments that I can only hit the highlights:</p>
<p>In August, ARM’s Alan Rampon wrote a blog post listing 17 microcontroller vendors that were offering a broad range of low-power devices based on various ARM Cortex-M series processor cores. The vendor list includes:</p>
<ul>
<li>Analog Devices (Cortex-M3)</li>
<li>Atmel (Cortex-M3)</li>
<li>Broadcom (Cortex-M3)</li>
<li>Cypress Semiconductor (Cortex-M3)</li>
<li>Dust Networks (Cortex-M3)</li>
<li>Ember (Cortex-M3)</li>
<li>Energy Micro (Cortex-M0, M3)</li>
<li>Freescale Semiconductor (Cortex-M4)</li>
<li>Fujitsu (Cortex-M3)</li>
<li>Holtek (Cortex-M3)</li>
<li>Nuvoton (Cortex-M0)</li>
<li>NXP (Cortex-M0, M3, M4)</li>
<li>ON Semiconductor (Cortex-M3)</li>
<li>Samsung (Sortex-M0, M3)</li>
<li>ST Microelectronics (Cortex-M3)</li>
<li>Texas Instruments (Cortex-M3)</li>
<li>Toshiba (Cortex-M3)</li>
</ul>
<p>That list is probably somewhat dated already, but you get the idea. The proliferation of low-power microcontrollers greatly accelerated during 2011. One such device that really sticks in my mind (because it’s recent), is the onset of shipments of the <a href="http://eda360insider.wordpress.com/2011/12/05/asymmetric-dual-core-nxp-lpc4300-microcontrollers-split-tasks-between-arm-cortex-m4-and-m0-cores-cost-3-75-and-up/" target="_blank">NXP Semiconductor LPC4350</a>, which packs an ARM Cortex-M4 and an ARM Cortex-M0 into one microcontroller that costs less than $4 in quantities of 10,000.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/NXP-LPC4350-block-diagram.jpg"><img class="aligncenter size-full wp-image-738" title="NXP LPC4350 block diagram" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/NXP-LPC4350-block-diagram.jpg" alt="" width="560" height="454" /></a></p>
<h3>This microcontroller is on the forefront of a new wave of processor design called “asymmetric multiprocessing” and there’s a real “wave-of-the-future” look to this development. (See “<a href="http://eda360insider.wordpress.com/2011/12/07/more-news-on-the-asymmetric processing-soc-front/" target="_blank">More news on the asymmetric processing SoC front</a>”)</h3>
<h3><strong>Asymmetric Multiprocessing</strong></h3>
<p>The microprocessor is 40 years old (last month!) and silicon microprocessor implementations have really advanced over those four decades while many of our design memes have not. In particular, I’m thinking of the meme that says “processors are expensive, so layer as many tasks as possible on a processor to save money.” The net effect of this meme is to make us develop increasingly complex multitasking schemes in an attempt to get processor utilization up to 80% or 90% or perhaps even 95%.</p>
<p>Now any engineer can tell you that when you load any component to near 100%, you have just sent and engraved, gold-plated invitation to Murphy, asking for an audience. In other words, something will go wrong. You won’t always get the latency you expected. You won’t always get the bandwidth you need.</p>
<p>So you’d better ask yourself: Are complex multitasking systems really worth the effort when I can get two 32-bit microprocessor cores in one device for less than $4? You’d better be serious coming up with that answer. I believe that asymmetric multiprocessing will remake all of design, including low-power design, during this coming decade.</p>
<p>Asymmetric multiprocessor design wasn’t the only innovation that loomed in 2011. Xilinx finally <a href="http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/" target="_blank">announced</a><a href="../2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/"></a> the first four members of its new Zynq 7000 EPP (Extensible Processing Platform) family, which fuses a processor complex containing two ARM Cortex-A9 processor cores with an FPGA fabric.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-EPP-Block-Diagram-v4.jpg"><img class="aligncenter size-full wp-image-737" title="Xilinx EPP Block Diagram v4" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-EPP-Block-Diagram-v4.jpg" alt="" width="587" height="469" /></a></p>
<p>Now assembling systems with microprocessors and FPGAs isn’t new. In fact, putting processor cores and FPGA fabrics onto the same piece of silicon isn’t particularly new either. However, doing it right? That is new. And this development fits into the low-power design world because putting the processor complex and the FPGA fabric on chip with a massive on-chip interconnect between the two cuts interface power significantly by reducing the interconnect frequency. You don’t need GHz interconnect clock rates when you have thousands of wires for parallel interconnect.</p>
<h3><strong>2.5D IC Assembly</strong></h3>
<p>Speaking of Xilinx, the company started shipping engineering samples of the Virtex-7 2000T FPGAs to customers last month and this too is a low-power design story. The story is completely told in this graphic:</p>
<p>&nbsp;</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-Virtex-7-2000T.jpg"><img class="aligncenter size-full wp-image-735" title="Xilinx Virtex 7 2000T" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-Virtex-7-2000T.jpg" alt="" width="560" height="319" /></a></p>
<p>&nbsp;</p>
<p>The Xilinx Virtex-7 2000T is a very large FPGA with two million logic elements. But it’s not a monolithic piece of silicon. Rather, the Virtex-7 2000T consists of four “identical” FPGA tiles, each with half a million logic elements (and a ton of other stuff). The FPGA tiles are mounted on a silicon interposer, which establishes more than 10,000 connections between each tile (56,000 connections in total). The silicon interposer is a fascinating piece of technology. It’s a 65nm IC with four layers of metal on each side of the die and no transistors. It’s a silicon circuit board that must be made in a wafer fab. In this case, TSMC owns the fab. The interposer is as large as the stepper reticule will allow. The advantage here is that each FPGA tile is a quarter of the size of the interposer, and die yield has an exponential relationship to die size. The smaller the die, the better the yield percentage. So 2.5D assembly makes a lot of sense in several different ways.</p>
<p>The 2.5D IC assembly-with-interposer approach taken to create the Xilinx Virtex-7 2000T allows the FPGA tiles to use lower power I/O drivers because these drivers will only be driving short, closely controlled traces between adjacent tiles. That system-design knowledge saves power. Although the Xilinx Virtex-7 2000T uses four identical die fabricated with a 28nm process technology to realize the active elements, 2.5D IC assembly permits heterogeneous die assembly as well, as shown in this <a href="http://eda360insider.wordpress.com/2011/11/16/3d-thursday-how-xilinx-developed-a-2-5d-strategy-for-making-the-worlds-largest-fpga-and-what-the-company-might-do-next-with-the-technology/" target="_blank">image</a> from Xilinx:</p>
<p>&nbsp;</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-2000T-next-step.jpg"><img class="aligncenter size-full wp-image-736" title="Xilinx 2000T next step" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/12/Xilinx-2000T-next-step.jpg" alt="" width="560" height="171" /></a></p>
<p>&nbsp;</p>
<p>As you can see, 2.5D IC assembly allows designers the freedom to intermix die from radically different IC technologies such as logic, memory (DRAM, Flash, SRAM, etc.), analog, and RF. It’s a pc-board-like technology but on a much smaller scale. The resulting 2.5D device may well be better optimized and cost less than it might if the design team attempted to place everything on one monolithic die. That’s a topic I’ll take up in Part B of this blog entry.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/12/17/2011-a-great-year-for-low-power-design-wasn%e2%80%99t-it/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>“Watt’s Next?” asks Chris Malachowsky, co-founder, NVIDIA Fellow, and Senior VP or Research</title>
		<link>http://low-powerdesign.com/sleibson/2011/11/10/%e2%80%9cwatt%e2%80%99s-next%e2%80%9d-asks-chris-malachowsky-co-founder-nvidia-fellow-and-senior-vp-or-research/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/11/10/%e2%80%9cwatt%e2%80%99s-next%e2%80%9d-asks-chris-malachowsky-co-founder-nvidia-fellow-and-senior-vp-or-research/#comments</comments>
		<pubDate>Thu, 10 Nov 2011 05:04:49 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[Multicore]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[Cortex-A9]]></category>
		<category><![CDATA[GeForce]]></category>
		<category><![CDATA[GPU]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[Kal-El]]></category>
		<category><![CDATA[NVIDIA]]></category>
		<category><![CDATA[QUADRO]]></category>
		<category><![CDATA[Supercomputer]]></category>
		<category><![CDATA[Supercomputing]]></category>
		<category><![CDATA[Tegra]]></category>
		<category><![CDATA[Tesla]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=712</guid>
		<description><![CDATA[Everything—literally everything—we design today is defined by its power consumption said Chris Malachowsky, an NVIDIA co-founder, fellow, and senior VP of research. Malachowsky spoke yesterday at a luncheon during the ICCAD conference held this week in San Jose, California. At &#8230; <a href="http://low-powerdesign.com/sleibson/2011/11/10/%e2%80%9cwatt%e2%80%99s-next%e2%80%9d-asks-chris-malachowsky-co-founder-nvidia-fellow-and-senior-vp-or-research/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Everything—literally everything—we design today is defined by its power consumption said Chris Malachowsky, an NVIDIA co-founder, fellow, and senior VP of research. Malachowsky spoke yesterday at a luncheon during the ICCAD conference held this week in San Jose, California. At the low end of the system spectrum, mobile devices are defined by how much you can do with a Watt. At the high end, supercomputers and supercomputer performance are now defined by how much electricity you can afford. Malachowsky joked that supercomputers now use so much electricity that the local power company is giving them away for free when you sign up for a 2-year service contract just like mobile phone handsets are subsidized by the carriers here in the US. That’s funny enough to hurt.</p>
<p>Coincidentally, NVIDIA makes chips that go into both wireless handsets at the low end (the company’s Tegra series of processors) and supercomputers at the high end (the company’s Tesla series of GPU—graphics processing unit—processing chips). In between are the original NVIDIA products, the GeForce and QUADRO series of graphics chips and boards. Here’s a graphic of the NVIDIA product line from Malachowsky’s talk:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/NVIDIA-Product-Line.jpg"><img class="aligncenter size-full wp-image-713" title="NVIDIA Product Line" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/NVIDIA-Product-Line.jpg" alt="" width="560" height="266" /></a></p>
<p>Tegra mobile application processors go into mobile handsets that cost roughly $100 but deliver about 2x the CPU performance and 4x the GPU performance of a PC that sold 10 years ago for about $3200. Here’s the specs for comparison:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/PC-versus-handset.jpg"><img class="aligncenter size-full wp-image-714" title="PC versus handset" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/PC-versus-handset.jpg" alt="" width="560" height="334" /></a> </p>
<p>The quest for performance isn’t going to stop in the handset space so NVIDIA has a roadmap for future processors. The product currently on the market is the Tegra 2 and NVIDIA has already previewed the next step up, code-named Kal-El (Superman’s original name on Krypton). (Note: for more information on Kal-El, see my blog posts “<a href="http://eda360insider.wordpress.com/2011/09/20/processor-wars-nvidia-reveals-a-phantom-fifth-arm-cortex-a9-processor-core-in-kal-el-mobile-processor-ic-guess-why-it%E2%80%99s-there/" target="_blank">Processor Wars: NVIDIA reveals a phantom fifth ARM Cortex-A9 processor core in Kal-El mobile processor IC. Guess why it’s there?</a>” and “<a href="http://eda360insider.wordpress.com/2011/09/23/friday-video-why-do-you-need-four-arm-cortex-a9-processorcores-in-a-mobile-processor-soc/" target="_blank">Friday Video: Why do you need four ARM Cortex-A9 processor cores in a mobile processor SoC?</a>”)</p>
<p>Here’s an NVIDIA roadmap for Tegra processors from Malachowsky’s talk:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Tegra-Roadmap.jpg"><img class="aligncenter size-full wp-image-715" title="Tegra Roadmap" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Tegra-Roadmap.jpg" alt="" width="560" height="352" /></a></p>
<p>Note that NVIDIA likes to use superhero alter-ego names for future Tegra processors.</p>
<p>One thing that’s not progressing quickly on the mobile handset front is battery capacity. Batteries are just not getting better as fast as we’re adding transistors to silicon die thanks to Moore’s Law. As a result, the Tegra processors, like all mobile application processors, are constrained by the amount of power available in a handset.</p>
<p>On the supercomputer front, NVIDIA Tesla GPU chips already power three of the five fastest supercomputers in the world: the Tianhe-1A, the Titan (evolved from the x86 Jaguar), and the Nebulae. These supercomputers use a lot of processors, as you can see from this image:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Top-Five-supercomputers.jpg"><img class="aligncenter size-full wp-image-716" title="Top Five supercomputers" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Top-Five-supercomputers.jpg" alt="" width="560" height="323" /></a></p>
<p>The reason that NVIDIA chips are in supercomputers at all is because researchers and students recognized that NVIDIA’s evolving line of graphics chips contained a lot of parallel processing power and if certain tough math problems and algorithms could be re-expressed to look like problems of drawing and shading triangles, then GPUs could be pressed into service for these other sorts of problems. This conceptual leap resulted in the development of the NVIDIA Tesla line of supercomputing GPU chips.</p>
<p>However, supercomputers are also being constrained by power. Not in how much power is available—it takes megaWatts to run a supercomputer—but by how much power is affordable. And don’t forget, for every megaWatt needed to power the supercomputer, you need a comparable amount of power to cool the supercomputer.</p>
<p>Even the US Department of Energy (DOE) is concerned. It recently put out a Request for Information (RFI) to find out how we might build a 1-Exaflop (an Exaflop is a billion Gigaflops) supercomputer that “only” consumes 20MW (!!!) On the current commercial trajectory, with no extra DOE help, we will eventually be able to build an Exabyte supercomputer but it will consume four or five times the amount of energy said Malachowsky.</p>
<p>Why do we need an Exaflop supercomputer? Because simulation has replaced the wet lab, said Malachowsky. Science, all science, needs simulation and the more the better. The more the faster. As Malachowsky said in his talk, science needs 1000x more computing (but without 1000x the power consumption) because simulation or “computational science” has become the third pillar of science.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Third-Pillar-of-Science.jpg"><img class="aligncenter size-full wp-image-717" title="Third Pillar of Science" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Third-Pillar-of-Science.jpg" alt="" width="560" height="353" /></a></p>
<p>(Note: Theory and Experimentation are the first two pillars. Yeah, I didn’t know that either, but I have it on the authority of the <a href="http://www.nitrd.gov/pitac/reports/20050609_computational/computational.pdf" target="_blank">President’s Information Technology Advisory Committee</a>.)</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Process-Fairy.jpg"><img class="alignright size-full wp-image-718" title="Process Fairy" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Process-Fairy.jpg" alt="" width="120" height="161" /></a>And get this: no more lazy, lazy processor or system architects. The “process fairies” aren’t working as hard as they used to, said Malachowsky. Oh sure, they’re still bringing us 2x the transistor count with each new IC process step just like Gordon Moore promised way back in 1965. Sure, the process fairies are keeping that promise. But poor Dennard. His observation about power and speed scaling with lithographic geometry—that’s dead. It died at 90nm. Party’s over.</p>
<p>So what? Here’s what. We’re going to have to rethink our approaches to getting more processing performance using less power. Scaling is out and here’s the graphic proof from Malachowsky’s talk:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/A-new-approach-is-needed.jpg"><img class="aligncenter size-full wp-image-719" title="A new approach is needed" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/A-new-approach-is-needed.jpg" alt="" width="560" height="377" /></a></p>
<p>Without architectural innovation, the average annual rate of processor performance improvement appears to be dropping from 52% to 20%. Architecture is in and we’ve got to get smarter because the process fairies aren’t working as hard as they used to.</p>
<p>What can we do? Well, one approach is already evident in the design of the multicore NVIDIA Kal-El mobile application processor. The Kal-El chip contains five ARM Cortex-A9 processor cores. Architecturally similar, one of the five ARM processor cores is synthesized for low-power operation. The other four identical cores are synthesized for maximum performance and consequently draw more power. When the Kal-El chip has a lot of work to do, one or more of the high-performance cores is operating. When there’s just a little work to do, the operating system transfers the work load to the low-power core and shuts down all four of the high-performance cores. The Android OS already knows how to do this.</p>
<p>Kal-El’s low-power “companion” ARM Cortex-A9 core is an example of an emerging SoC design style called “dark silicon.” Fortunately, dark silicon is much easier to understand that dark matter or dark energy. Dark silicon simply describes sections of an SoC that are shut down and powered off. In earlier days when there weren’t enough transistors to go around, letting a piece of silicon go dark was unthinkable. In fact, we loaded up a processor with as much work as it could do and perhaps even a little more if we needed to push things. Dark silicon? Fugetaboutit. But now in the multicore era, we’re getting quite used to the idea.</p>
<p>However, dark silicon isn’t going to save us by itself. We need to get smarter about what we do inside of a single core as well said Malachowsky. We’re going to get smart about the energy cost of everything we do inside of a processor core. Malachowsky didn’t directly explain what this means but he did provide a clue.</p>
<p>Here’s a table from Malachowsky’s presentation that shows the energy cost of typical processor transactions:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Energy-cost-of-actions.jpg"><img class="aligncenter size-full wp-image-720" title="Energy cost of actions" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/11/Energy-cost-of-actions.jpg" alt="" width="560" height="366" /></a></p>
<p>Note the pattern in this table. The energy costs for moving operands around on chip are comparable to those for performing a computation. This ratio actually gets worse for data movement as lithographic scaling progresses because gates get smaller but the average wire length and cross-sectional resistance get larger. The energy cost for moving an operand on or off chip is higher still. It takes power to wiggle those printed-circuit board traces.</p>
<p>As I said, it was just a clue.</p>
<p>One of the most interesting parts of Malachowsky’s talk for me was where the funding for this architectural research will come from. I would never have guessed.</p>
<p>Video games.</p>
<p>That would be the two middle NVIDIA product lines shown in the first image in this blog post—GeForce and QUADRO. It seems that the video gaming market is pretty big—about $35 billion per year. That’s bigger than the movie market (and way bigger than EDA). Hard-core gamers will apparently pay handsomely for architectural advances as long as it lets them shoot faster.</p>
<p>So when we cure cancer, you can thank a gamer. Meanwhile, give some thought to Malachowsky’s words. There are a lot of really sharp ideas for designers of low-power systems in this presentation.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/11/10/%e2%80%9cwatt%e2%80%99s-next%e2%80%9d-asks-chris-malachowsky-co-founder-nvidia-fellow-and-senior-vp-or-research/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Altera introduces SoC FPGA melding ARM Cortex-A9 dual-core processor complex with a 28nm FPGA fabric</title>
		<link>http://low-powerdesign.com/sleibson/2011/10/11/altera-introduces-soc-fpga-melding-arm-cortex-a9-dual-core-processor-complex-with-a-28nm-fpga-fabric/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/10/11/altera-introduces-soc-fpga-melding-arm-cortex-a9-dual-core-processor-complex-with-a-28nm-fpga-fabric/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 12:00:33 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[FPGA]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[SDRAM]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[Altera]]></category>
		<category><![CDATA[SoC FPGA]]></category>
		<category><![CDATA[Xilinx]]></category>
		<category><![CDATA[Zynq]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=668</guid>
		<description><![CDATA[Xilinx first started to talk publicly about the fusion of processors and FPGAs—a product now known as Zynq—in 2010 and has announced plans to roll out parts by the end of this year. It was inevitable that Altera would eventually &#8230; <a href="http://low-powerdesign.com/sleibson/2011/10/11/altera-introduces-soc-fpga-melding-arm-cortex-a9-dual-core-processor-complex-with-a-28nm-fpga-fabric/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Xilinx first started to talk publicly about the fusion of processors and FPGAs—a product now known as Zynq—in 2010 and has announced plans to roll out parts by the end of this year. It was inevitable that Altera would eventually counter with a competing product line. Today the company revealed plans for a line of chips called SoC FPGAs and comparisons between the Altera and Xilinx offerings are inevitable, but let’s look at the details for the Altera offerings.</p>
<p>The SoC FPGA line will include at least 18 different chips with various configurations for the “Hard Processor System” (HPS) and various sizes for the FPGA fabrics connected to the HPS block. In addition, the SoC FPGA product line will be based on two of the Altera 28nm FPGA fabrics—Cyclone V and Arria V—for two different speed grades within the product line. Here’s a generalized block diagram of a device in the product line:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/10/Altera-SoC-FPGA-Block-Diagram.jpg"><img class="aligncenter size-full wp-image-669" title="Altera SoC FPGA Block Diagram" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/10/Altera-SoC-FPGA-Block-Diagram.jpg" alt="" width="513" height="757" /></a></p>
<p>The SoC FPGAs’ HPS is based on two 800MHz ARM Cortex-A9 processor cores with ARM Neon and single/double-precision FPU extensions. Each ARM Cortex-A9 processor has its own L1 caches—separate 32Kbyte L1 caches for instructions and data. The two processor cores share a unified 512Kbyte L2 cache. Each processor also has private interval and watchdog timers. To keep the two processor cores fed with instructions and data, there’s a hard-core, multiport DDR SDRAM controller in the HPS that supports the DDR2 and DDR3 and the LPDDR1 and LPDDR2 SDRAM interface protocols. There’s also a Flash memory controller with a built-in DMA engine. The NAND Flash controller supports NOR and NAND Flash memories including ONFi 1.0 devices and SD, SDIO, and MMC memory cards. In addition, there’s ECC support for the SDRAM and the NAND Flash interfaces.</p>
<p>Next up are the hard-core peripherals within the HPS. There are a lot of them:</p>
<ul>
<li>Two 10/100/1000 Ethernet MACs with DMA</li>
<li>Two USB 2.0 On-the-Go (OTG) controllers with DMA</li>
<li>Four I2C controllers</li>
<li>Two CAN (Controller Area Network) controllers</li>
<li>SPI Master and SPI Slave ports</li>
<li>Two UARTs</li>
<li>General-purpose ports</li>
</ul>
<p>On-chip memory includes 64Kbytes of RAM and a boot ROM.</p>
<p>That’s already quite a lot but then there’s the FPGA section  of the SoC FPGA to consider. On-chip FPGA capacity varies depending on whether the particular SoC FPGA device is based on the Cyclone V or Arria V FPGA fabrics. Devices based on the Cyclone V FPGA fabric will be offered with 25K, 40K, 85K, and 110K logic elements. Devices based on the Arria V FPGA fabric will be offered with 350K and 460K logic elements.</p>
<p>The HPS in the Altera SoC FPGA connects to the on-chip FPGA fabric though two 128-bit AXI buses—one for reads and one for writes. As you can see from the block diagram above, the hard-core peripherals not included in the HPS block separately connect to the FPGA fabric. What’s not apparent from the diagram is that the two ARM Cortex-A9 processors share a Snoop Control Unit (SCU) and there&#8217;s an ACP (accelerator coherency port) linking the HPS to the FPGA fabric so it’s possible to engineer accelerators that maintain coherency with the ARM Cortex-A9 processor cores&#8217; caches and implement them using the on-chip FPGA fabric.</p>
<p>In addition to the six FPGA array sizes (four for the devices based on the Cyclone V FPGA fabric and two for devices based on the Arria V FPGA fabric), Altera plans to offer parts with three HPS subsystem configurations: base, mid, and high. Combined with the six FPGA fabric sizes, that means there are at least 18 Altera SoC FPGA parts planned for the initial product lineup. Altera says that there will also be 1-processor variants in the SoC FPGA lineup. Just in case you suspect that’s perhaps a bit underpowered, keep in mind that essentially 100% of all system designs based on microcontrollers use a far less capable processor core than one 800MHz ARM Cortex-A9 core. You might want to check to make sure you’re not becoming overly acclimatized to multicore designs. On the other hand, if you’re running Android then two capable processor cores will come in handy.</p>
<p>As the block diagram above shows, there are additional hard-core peripherals connected to the SoC FPGA chip’s FPGA array: as many as three more multiport SDRAM controllers, a Gen2 x4 PCIe port (supplemented with the possibility of implementing a soft Gen2 x8 PCIe port in the FPGA fabric), and as many as six 10Gbps high-speed, differential  serial transceivers and as many as thirty 6Gbps high-speed, differential  serial transceivers. These additional peripheral ports have separate access paths into the FPGA fabric of the SoC FPGA devices.</p>
<p>Perhaps the most interesting news is that low-end members of the Altera SoC FPGA family will sell for $15 in “high volumes.” That’s a lot of capability for a relatively low price. In fact, that’s a very low price in the FPGA world. The bad news is that Altera doesn’t plan to ship devices until the second half of 2012.</p>
<p>So that’s the Altera SoC FPGA. Now for the inevitable comparison based on my previous write-ups of the Xilinx Zynq. (See “<a href="http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/" target="_blank">Xilinx Zynq EPPs create a new category that fits in among SoCs, FPGAs, and microcontrollers</a>”.) First, there’s the processor complex—what Altera calls the HPS. The two products are remarkably similar here: two 800MHz ARM Cortex-A9 processor cores with Neon DSP and FPU extensions, 512Kbytes of unified L2 cache, Flash controller, one SDRAM controller, Snoop Control Unit, timers and watchdog, DMA, etc. Both processor complexes support an ACP (Accelerator Coherency Port) interface into the FPGA fabrics.</p>
<p>There’s some difference in the processor complex-to-FPGA connection scheme: Altera offers one 128-bit read and one 128-bit write AXI bus and a 32-bit APB (Advanced Peripheral Bus) port plus additional ports that go directly from the FPGA fabric to the multiport SDRAM controller in the HPS. Xilinx offers four 32-bit and four 64-bit AXI ports plus direct access from the FPGA fabric to the SDRAM controller. So the Xilinx parts theoretically provide more raw interconnect bandwidth between the processor complex and the FPGA fabric than do the Altera parts. It remains to be seen if that raw capability can deliver more bandwidth in practice, but the potential is clearly there.</p>
<p>But wait! Hold on there! The Altera SoC FPGA parts offer as many as three more SDRAM controllers outside of the hard-core HPS processor complex and those SDRAM controllers can be connected either to devices implemented as soft cores in the on-chip FPGA fabric or through the FPGA fabric to the HPS. That added SDRAM control capability could really be an advantage in systems with extremely high SDRAM bandwidth requirements.</p>
<p>Then there’s the PCIe controller. On the Altera SoC FPGAs, there’s one hard-core Gen2 x4 PCIe port and the possibility of implementing a second, soft-core Gen2 x8 PCIe port in the FPGA fabric. The Xilinx Zynq parts will provide a hard-core Gen2 x4 or x8 PCI port, depending on the family member. There are additional 10.3Gbps serial channels available on the Xilinx Zynq components, so a soft-core PCIe controller is a possibility, as it is for the Altera SoC FPGAs.</p>
<p>Since I’ve brought up the topic of the FPGA fabric, let’s compare those as well. The various Altera SoC FPGA family members offer six FPGA fabric sizes: 25K, 40K, 85K, 110K, 350K, and 460K logic elements. The announced Xilinx Zynq family offers four fabric sizes: 30K, 85K, 125K, and 235K logic elements. So if you need really big FPGA fabrics to complement the capabilities provided by the processor complexes, then the Altera SoC FPGA family seems to offer more capacity for now. However, should a battlefield form at the high end, you can bet that Xilinx will be filling out the product line at the high end, where there’s more margin to be made.</p>
<p>Finally there’s pricing and availability. Both companies have announced high-volume unit pricing “below $15”  but the Xilinx parts are supposed to be available this year and the Altera parts are scheduled to appear in the latter part of next year.</p>
<p>Together, today&#8217;s Altera SoC FPGA announcement and the previous Xilinx Zynq announcements create a truly exciting new product category—one that fuses FPGAs with high-performance microprocessors in a way guaranteed to dramatically extend the reach of FPGAs. The resulting mixture of capability, performance, power consumption, and cost simply cannot be replicated with a 2-chip design.</p>
<p>I predict that many system designers will be unable to resist this combination. Naysayers will point to previous failed attempts at merging FPGAs and hard microprocessor cores and some will predict a similar fate for this new generation of parts. Much has changed. First, the embedded industry has adopted the ARM architectures  and there is a large body of programming talent available for this architecture. Second, these new parts are not FPGAs with processor cores tacked on. They are very capable and complete processor complexes, application processors in their own right, augmented with FPGA fabrics. From my perspective, the Altera SoC FPGAs and Xilinx Zynq parts stand a very good good chance of definining a new and vibrant component category.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/10/11/altera-introduces-soc-fpga-melding-arm-cortex-a9-dual-core-processor-complex-with-a-28nm-fpga-fabric/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Think Globally, Act in Parallel. What can you do with one million ARM cores acting in parallel and how do you get there?</title>
		<link>http://low-powerdesign.com/sleibson/2011/07/16/think-globally-act-in-parallel-what-can-you-do-with-one-million-arm-cores-acting-in-parallel-and-how-do-you-get-there/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/07/16/think-globally-act-in-parallel-what-can-you-do-with-one-million-arm-cores-acting-in-parallel-and-how-do-you-get-there/#comments</comments>
		<pubDate>Sat, 16 Jul 2011 23:47:06 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[ARM]]></category>
		<category><![CDATA[CMOS]]></category>
		<category><![CDATA[Design]]></category>
		<category><![CDATA[DRAM]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[Networking]]></category>
		<category><![CDATA[SDRAM]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[SRAM]]></category>
		<category><![CDATA[Cortex-M0]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[Samsung]]></category>
		<category><![CDATA[SpiNNaker]]></category>
		<category><![CDATA[UMC]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=615</guid>
		<description><![CDATA[Professor Steve Furber’s SpiNNaker project is in the news again. I wrote about Furber’s massively parallel brain-emulation project back on March 30 after listening to his keynote at this year’s DATE (Design Automation and Test Europe) conference in Grenoble, France. &#8230; <a href="http://low-powerdesign.com/sleibson/2011/07/16/think-globally-act-in-parallel-what-can-you-do-with-one-million-arm-cores-acting-in-parallel-and-how-do-you-get-there/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Professor Steve Furber’s SpiNNaker project is in the news again. I wrote about Furber’s massively parallel brain-emulation project back on March 30 after listening to his keynote at this year’s DATE (Design Automation and Test Europe) conference in Grenoble, France. (See “<a href="http://low-powerdesign.com/sleibson/2011/03/30/the-incredible-vanishing-power-of-a-machine-instruction-is-this-the-way-to-the-brain/" target="_blank">The incredible vanishing power of a machine instruction. Is this the way to the brain?</a>”) Furber’s DATE keynote title says it all: “Biologically-inspired massively-parallel architectures—computing beyond a million processors.” Furber and his team are referencing nature to help them tackle the really hard processing problems we need to solve in the future through massively parallel, brain-like computing. Brain-like computing—go slow, go wide, go massively parallel—seems to offer a proven, low-power approach to solving some of these big computational problems.</p>
<p>The SpiNNaker project is again in the news at EETimes Europe (see “<a href="http://www.electronics-eetimes.com/en/a-million-arm-cores-to-host-brain-simulator.html?cmp_id=7&amp;news_id=222908354&amp;vID=209" target="_blank">A million ARM cores to host brain simulator</a>”) and the idea of harnessing one million ARM processor cores is certainly a big idea. It excites me. However, we’re still at the humble beginnings of the project.</p>
<p>The SpiNNaker project’s first test chip harnesses 18 ARM9 cores on one 130nm chip manufactured by UMC in Taiwan. This is a 100M-transistor chip and, like most many-processor SoCs, the SpiNNaker SoC mostly consists of memory. The memory needs to be close to the processors for speed and for low-power consumption and there are 55 32Kbyte SRAM blocks on the SpiNNaker die. That’s 14 million bits of SRAM and, frankly speaking, that’s really not very much SRAM. Eighteen processors isn’t really a large number of processors either when your stated goal is one million.</p>
<p>The ARM processors on the SpiNNaker chip use packet communications to emulate the electrical spike communications that occur among the neurons in human and animal brains. From a hardware perspective, I think it’s easy to conceive of a system-level design like this and even conceptually scaling the design to a million connected ARM9 processors isn’t really hard, as long as you don’t try to enumerate all of the processors in your mind. However, with 18 processors per chip, you’ll need approximately 55,600 chips to build an interconnected network of one million processors. That’s still a mighty big box of hardware. More on that in a bit.</p>
<p>The rub is that we really don’t have many good ideas for programming such a massively parallel system. The SpiNNaker project seems to be mostly a hardware endeavor with the explicitly stated intent of developing a hardware testbed for brain researchers who will use SpiNNaker systems for studying various theories of brain function. Presumably, we’ll learn more about massively parallel programming by working with these systems and no doubt we will. As Furber says in a quote published in the EETimes Europe article, “We don&#8217;t know how the brain works as an information-processing system, and we do need to find out. We hope that our machine will enable significant progress towards achieving this understanding.&#8221;</p>
<p>Each SpiNNaker chip in the current design is bundled with a 166MHz, 1Gbit DDR SDRAM and packaged in a 300-pin BGA package. But we’re not going to be building million-processor testbeds with 18 processors per packaged chip. I’m almost absolutely, positively certain about that. This first SpiNNaker prototype just doesn’t scale to one million processors very easily. So the question is, how to get there?</p>
<p>Well, possible clues to answer that question can be found in two recent blogs that I wrote on the <strong>EDA360 Insider</strong> blog. First, Samsung has just announced successful tapeout of a 20nm test chip incorporating an ARM Cortex-M0 processor core. (See “<a href="http://eda360insider.wordpress.com/2011/07/12/samsung-20nm-test-chip-includes-arm-cortex-m0-processor-core-how-many-will-fit-on-the-head-of-a-pin/" target="_blank">Samsung 20nm test chip includes ARM Cortex-M0 processor core. How many will fit on the head of a pin?</a>”) Now an ARM Cortex-M0 processor is not as powerful as an ARM9 processor, but then it’s not supposed to be. It’s designed for control-oriented applications and its 3-stage execution pipeline isn’t designed to get maximum speed from any given process technology. However, we’re building a system that emulates a brain that operates at a few hundred Hertz (that’s <strong>Hertz</strong>, not kilohertz, megahertz, or gigahertz) so I really don’t think the clock speed is all that critical when you’re talking about a million processors. The ARM Cortex-M0 processor core is still a 32-bit RISC processor and I am guessing with a high degree of confidence that it’s fully up to the task of executing the required electrical-spike calculations, albeit not quite as quickly as an ARM9 processor.</p>
<p>What’s interesting about a 12-to-14Kgate ARM Cortex-M0 processor implemented in 20nm process technology is that my calculations suggest that more than half a million ARM Cortex-M0 processors would fit on a chip the size of an Intel “Tukwila” Itanium processor (OK, that’s a big chip, but it’s a commercial one) and that calculation is based on the published number for the area required by an ARM Cortex-M0 implemented in 90nm process technology, not 20nm. Now there’s a lot of slop in this calculation. First, there’s the disparity of using 90nm numbers instead of 20nm numbers. Then there’s the disparity caused by putting no memory at all into the calculation. I just mentally tiled processors edge to edge. Ditto, there’s no on-chip interconnect.</p>
<p>So you probably won’t get half a million ARM Cortex-M0 processor cores on one 20nm chip. But you might get 100,000 or 200,000 ARM Cortex-M0 processor cores on a chip along with an interesting amount of memory and the required interconnect. Now we’re talking about only a handful of chips to get to one million processors. We’re talking about a tabletop box. Now we’re getting into the realm of the feasible for million-processor systems.</p>
<p>The second related blog entry I recently wrote in <strong>EDA360 Insider</strong> that also bears on this very interesting endeavor was about an announcement from Imec, a global research company. Just days ago, Imec announced that it and its partners successfully assembled a custom logic chip with two DRAMs in a stacked 3D configuration. (See “<a href="http://eda360insider.wordpress.com/2011/07/14/3d-thursday-imec-prototypes-3d-chip-stack-finds-some-thermal-surprises/" target="_blank">3D Thursday: IMEC prototypes 3D chip stack, finds some thermal surprises</a>”.) This 3D stacked-chip prototype allowed Imec to test out some process ideas for manufacturing 3D stacked chip assemblies and to make some critical thermal tests to verify thermal models that will be so necessary when 3D assembly goes mass market. The 3D chip stack uses copper-tin micro-bumps and compression bonding for the electrical and mechanical assembly of the chip stack and you can see photos of the assembled stack below.</p>
<p>Here’s a photo of the overall chip stack:</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/07/Imec-3D-Chip.bmp"><img class="aligncenter size-full wp-image-616" title="Imec 3D Chip" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/07/Imec-3D-Chip.bmp" alt="" /></a></p>
<p>And here’s a close-up of the edge of the chip stack to show the three stacked die.</p>
<p><a href="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/07/Imec-3D-Chip-Closeup.bmp"><img class="aligncenter size-full wp-image-617" title="Imec 3D Chip Closeup" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/07/Imec-3D-Chip-Closeup.bmp" alt="" /></a></p>
<p>The 3D Stack’s base chip is approximately 750µm thick. The two top components in the chip stack are each 25µm thick. There’s more technical info in the referenced <strong>EDA360 Insider</strong> blog.</p>
<p>I am convinced that 3D stacking of logic and RAM chips will be absolutely essential to developing massively parallel, low-power systems like the ones envisioned by the SpiNNaker project. First, the only way to feed data and instructions to massively parallel processing chips is through large amounts of on-chip memory and through high-bandwidth, low-energy channels connected to large off-chip memories. 3D assembly techniques permit both Wide I/O and high-speed serial I/O channels to work most effectively and at minimal energy levels and I expect to see rapid adoption of 3D assembly—even and perhaps especially in high-volume, cost-sensitive applications such as mobile phone handsets—in the next few years. This is precisely the sort of manufacturing technology we require to think seriously about million-processor systems.</p>
<p>Now all we need to do is figure out how to program them.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/07/16/think-globally-act-in-parallel-what-can-you-do-with-one-million-arm-cores-acting-in-parallel-and-how-do-you-get-there/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The DDR4 SDRAM spec and SoC design. What do we know now?</title>
		<link>http://low-powerdesign.com/sleibson/2011/05/12/the-ddr4-sdram-spec-and-soc-design-what-do-we-know-now/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/05/12/the-ddr4-sdram-spec-and-soc-design-what-do-we-know-now/#comments</comments>
		<pubDate>Thu, 12 May 2011 23:42:12 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[DDR4]]></category>
		<category><![CDATA[DRAM]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[SOC]]></category>
		<category><![CDATA[cadence]]></category>
		<category><![CDATA[Hynix]]></category>
		<category><![CDATA[Nanya]]></category>
		<category><![CDATA[Samsung]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=571</guid>
		<description><![CDATA[DDR4 SDRAM is coming. JEDEC may not have released the final spec yet but Samsung made the first DDR4 memory chip announcement in January of this year—a 2133MHz device built with a 30nm process technology—and Hynix followed suit in April &#8230; <a href="http://low-powerdesign.com/sleibson/2011/05/12/the-ddr4-sdram-spec-and-soc-design-what-do-we-know-now/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>DDR4 SDRAM is coming. JEDEC may not have released the final spec yet but Samsung made the first DDR4 memory chip announcement in January of this year—a 2133MHz device built with a 30nm process technology—and Hynix followed suit in April by announcing a 2400MHz device, also built with a 30nm process technology. Cadence announced a complete DDR4 IP package for SoC designers the same month. (See: “<a href="http://eda360insider.wordpress.com/2011/04/11/memory-to-processors-%E2%80%9Cwithout-me-you%E2%80%99re-nothing-%E2%80%9D-ddr4-is-on-the-way/" target="_blank">Memory to processors: ‘Without me, you’re nothing.’ DDR4 is on the way.</a>”) Nanya “sort of announced” a DDR4 memory device when it appeared in their most recent quarterly report. So there’s visible momentum for the DDR4 specification already even if JEDEC has yet to roll it out.</p>
<p> </p>
<p>At today’s EETimes Virtual SoC event Marc Greenberg from Cadence pulled back the veil on DDR4 a bit more. Here’s what he had to say.</p>
<p> </p>
<p>First, even though we don’t have a final specification, some details are public. DDR4 SDRAMs will have double the maximum capacity of DDR3 SDRAMs. They’ll also have twice the maximum clock frequency. Like DDR3 SDRAMs, DDR4 SDRAMs will have an 8n prefetch (important for cache-line-filling operations) but a DDR4 memory controller must alternate or rotate between SDRAM bank groups for maximum SDRAM performance. That’s a new restriction.</p>
<p> </p>
<p>The DDR4 I/O voltage has been reduced to 1.2V—DDR3 SDRAMs use 1.5V—so you can expect that the DDR4 SDRAMs will consume less power and energy than DDR3 SDRAMs simply from the lower operating voltage and from the more advanced process technology. However, Greenberg warned that some systems might not realize such savings due to architectural issues. In addition, DDR4 SDRAMs will not use stub-series terminated logic drivers. Instead, they’ll use pseudo-open drain (POD) drivers with Vdd terminations. DDR4 memories also have new features to improve signal integrity. They’ll use data-bit inversion (DBI, more on that below), on-chip parity detection for the command/address bus, and CRC error detection for the data.</p>
<p> </p>
<p>Because of the higher maximum clock rate, DDR4 memories may permit a pin-count reduction for some SoC designs. How? At double the clock rate, SoC designs can get the same data bandwidth with 16 data bits clocked at 1600MHz (3.2 Gtransfers/sec) as DDR3 designs get with an 800MHz clock rate. However, there’s a design caveat or two. First, SPB (silicon, package, board) design for DDR4-3200 SDRAM is going to be considerably harder than for DDR3-1600 SDRAM. In addition, most memory experts predict that designs with multiple DDR4 DIMMs on each memory channel will not be able to work reliably (or at all) starting with data transfer rate considerably below the 3.2 Gtransfers/sec maximum. Similarly, DIMMs with multiple memory ranks on the board may also fail before the data transfer rate reaches 3.2 Gtransfers/sec.</p>
<p> </p>
<p>There are a couple of possible solutions to these DDR4 signal-integrity challenges. The first and simplest solution is to allow only one DIMM slot per DDR4 memory channel and allow only single-rank DDR4 DIMMs. The problem with this solution is that it increases the number of SoC memory channels for a given memory capacity and thus drives up the SoC’s pin count, cost, and board-level real estate.</p>
<p> </p>
<p>No one likes any of those consequences. Not at all. So an alternative solution is the use of load-reduced DIMMs (LRDIMMs) as shown in the following figure.</p>
<p> </p>
<p><img class="aligncenter size-full wp-image-572" title="Greenberg - LRDIMM" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/05/Greenberg-LRDIMM.jpg" alt="Greenberg - LRDIMM" width="540" height="187" /> </p>
<p>Now you may be familiar with RDIMMs (registered DIMMs) used in servers. RDIMMs have an extra register chip soldered to the board that stores and buffers the address/control information from the memory controller and distributes that information to the memory chips on the DIMM. LRDIMMs also buffer the data lines to present a single load to the memory controller even when multiple memory ranks are soldered to the DIMM. RDIMMs and LRDIMMs increase memory latency, so the DDR4 controller must be able to understand and accommodate this kind of buffering.</p>
<p> </p>
<p>Finally, in the what-we-know category, DDR4 SDRAMs will stay with the 8n prefetch used for DDR3 memories but they will add an extra level of multiplexing so that the memory controller must manage traffic to and from the SDRAM even more carefully than before to extract maximum performance from the device.</p>
<p> </p>
<p>Here is where we leave the known DDR4 world and enter into the realm of conjecture.</p>
<p> </p>
<p>Although there are no public details on how DDR4 SRAM’s extra multiplexing level works, GDDR5 memory already employs a bank-grouping scheme with an extra level of multiplexing. GDDR5 memory adds new command timings that differ depending on whether successive commands address the same or different bank groups. These extra timings mean that a DDR4-optimized memory controller must be a bit more complex than the controller used for DDR3 memories. The controller needs better command scheduling and it must deal even more efficiently with high-priority memory commands. The Cadence DDR4 memory controller that was just introduced last month has several new features to accommodate the new complexities of the upcoming DDR4 memory protocol said Greenberg.</p>
<p> </p>
<p>Here’s a table of enhancements made to the controller’s command queue to accommodate DDR4 requirements and maximize memory-subsystem performance:</p>
<p> </p>
<p><img class="aligncenter size-full wp-image-573" title="Greenberg - DDR4 command queue table" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/05/Greenberg-DDR4-command-queue-table.jpg" alt="Greenberg - DDR4 command queue table" width="540" height="342" /></p>
<p> </p>
<p>One key feature here is a new command-prioritizing scheme that prioritizes DDR4 commands when they enter the command queue (like the DDR3 version of this controller) and then reprioritizes the commands when they’re about to exit from the queue, to be issued to the DDR4 memory. That part’s new. This new feature allows high-priority commands to go straight to the head of the command queue when they’re received, but controller can delay the command’s exit from the queue (and the issue of that command to the memory) until the target DDR4 memory page and bank are ready to accept that command. This capability reduces the impact of high-priority commands and helps to maximize memory bandwidth and throughput.</p>
<p> </p>
<p>Another new controller feature is support for DBI. The following figure illustrates the problem:</p>
<p> </p>
<p><img class="aligncenter size-full wp-image-574" title="Greenberg - DBI" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/05/Greenberg-DBI.jpg" alt="Greenberg - DBI" width="540" height="273" /> </p>
<p>The left side of this figure shows four consecutive data transfers. In the first transfer, all of the data bits are “1.” In the second transfer, they’re all “0.” As a result, all data bits change state from the first transfer to the next. This is a bad thing, especially at multi-GHz transfer rates. The effects of capacitive charge and discharge for all data lines at high speeds creates a problem called simultaneous switching output (SSO), which stresses the DRAM’s power-distribution system on the chip, in the package, and on the board. The next transfer shows a transition from all zeroes to all ones except for one data bit. Because of capacitive coupling, the data lines making the zero-to-one transition become aggressors that try to induce that lone holdout bit to also make the transition even though it does not want to do so. The fourth transfer exhibits a similar problem. All of the bits make a transition but one bit steadfastly wants to make the transition in the opposite direction. Again, it’s up against a number of aggressors.</p>
<p> </p>
<p>The way to solve this problem, implemented in GDDR5, is to add a DBI bit. The right side of the figure shown above illustrates the same state transitions, but with the addition of a DBI bit. When asserted, the DBI bit indicates that the data bus should be inverted. The inversion state can change from transfer to transfer and it is changed to minimize the number of data-bit state changes and thus minimize the I/O switching current and the number of aggressor bits from one transfer to the next. Again, this is how it’s done for GDDR5 memory. The DBI method used for DDR4 SDRAM is not yet public.</p>
<p> </p>
<p>With these and other changes to the memory interface specification, SoC designers will need a new tool set to add DDR4 memory interfaces to their designs. That’s why Cadence has introduced a DDR4 IP package and design kit now—because SoC designers preparing early designs that incorporate DDR4 memory need to start now. The Cadence DDR4 offerings include a DDR4-enhanced memory controller (based on the existing, configurable SDRAM controller Cadence obtained when it purchased Denali Software last year), hard and soft DDR4 PHYs, design kits for board- and package-level DDR4 design, and verification IP and memory models for DDR4 memory.</p>
<p> </p>
<p>Over the next several years, we will see DDR4 SDRAM gradually enter and then take over the SDRAM memory market. It’s happened with the DDR, DDR2, and DDR3 SDRAM generations and there’s little reason to believe this won’t happen with DDR4 as well. Greenberg said to expect to see designs from early adopters who need maximum memory subsystem possible performance in 2013, early majority adoption for high-performance (but not the most bleeding-edge) designs in 2014, majority adoption in desktop and laptop PCs in 2015, and then pretty much total market penetration all the way down to low-cost devices by 2016.</p>
<p> </p>
<p>Time to get going, isn’t it?</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/05/12/the-ddr4-sdram-spec-and-soc-design-what-do-we-know-now/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The incredible vanishing power of a machine instruction. Is this the way to the brain?</title>
		<link>http://low-powerdesign.com/sleibson/2011/03/30/the-incredible-vanishing-power-of-a-machine-instruction-is-this-the-way-to-the-brain/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/03/30/the-incredible-vanishing-power-of-a-machine-instruction-is-this-the-way-to-the-brain/#comments</comments>
		<pubDate>Wed, 30 Mar 2011 03:25:56 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[Networking]]></category>
		<category><![CDATA[SOC]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=525</guid>
		<description><![CDATA[I attended DATE (Design and Test Europe) this month in Grenoble and was fascinated by Steve Furber’s keynote titled “Biologically-inspired massively-parallel architectures—computing beyond a million processors.” Furber’s introductory remarks really clarify what’s been happening to the energy cost per instruction &#8230; <a href="http://low-powerdesign.com/sleibson/2011/03/30/the-incredible-vanishing-power-of-a-machine-instruction-is-this-the-way-to-the-brain/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I attended DATE (Design and Test Europe) this month in Grenoble and was fascinated by Steve Furber’s keynote titled “Biologically-inspired massively-parallel architectures—computing beyond a million processors.” Furber’s introductory remarks really clarify what’s been happening to the energy cost per instruction executed over the past 60 years—and what’s likely to happen in the future. Strike that—make it “what’s got to happen.” Just in case you didn’t know, Furber was the principal designer of the original ARM processor back when “ARM” stood for “Acorn RISC Machine.” Acorn was a leading UK personal computer maker and in the early 1980s, it decided it needed its own microprocessor. The rest, as they say, is history. Acorn is gone. ARM is here, big time.</p>
<p>But back to Furber. Today, he’s the ICL Professor of Computer Engineering at the School of Computer Science, Manchester University, UK and his CV sports a long list of impressive achievements. Let’s just say he’s been busy since leaving ARM. These days, he and his group at Manchester University are developing digital ways to emulate organic brain functions. In essence, his group is developing digital analogs of neural networks. Now electronic neural networks aren’t something new. I can remember discussing them when I was a college freshman. That was 1971. Not new. Not recent.</p>
<p>The Manchester University team is developing an SoC with a “massively” parallel network of eighteen ARM 968 RISC processors all mutually interconnected through a Silistix self-clocked network on chip (NoC). Furber had a hand in the early development of this NoC, also at Manchester University. (See, he’s been busy, like I said.) The project is called SpiNNaker. (<a href="http://apt.cs.man.ac.uk/projects/SpiNNaker/" target="_blank">http://apt.cs.man.ac.uk/projects/SpiNNaker/</a>)</p>
<p>Now there’s a reason for repeatedly emphasizing Furber’s connections to Manchester University and he discussed it in his keynote. Any serious discussion of the history of computing must include the Manchester University Mark I “Baby,” which was the first fully programmable, stored-program digital computer to go online. Baby executed its first program in 1948. ENIAC, developed at the Moore School of Electrical Engineering at the University of Pennsylvania and usually called the first fully electronic computer, was operational two years before the Manchester Baby. But ENIAC was physically programmed with wires—at least initially. Eventually, ENIAC was retrofitted with some programmability but the Manchester Baby was first.</p>
<p>When operational, the Manchester Baby computer executed roughly 800 instructions per second. That was a heck of a lot faster than the mechanical calculators and punched-card equipment of the day but it’s laughably slow when compared to today’s processors. (Even the Intel 4004, the world’s first commercial microprocessor introduced in 1971, executed 108,000 instructions/second.) More to the point for the purposes of this blog, the Manchester Baby consumed approximately 5 Joules of energy to execute each instruction.</p>
<p>Fast forward to today and those ARM 968 microprocessors in the SpiNNaker chip. An ARM 968 processor executes roughly 20 million instructions per second, dissipating 10^-10 Joules per instruction. In other words, the per-instruction energy consumption needed to execute a machine instruction has improved by a factor of about 50 billion in 60 years.</p>
<p>Now the old, worn comparison usually asks you to consider what the world would be like today if automobile manufacturers had improved the energy consumption of their products by a factor of 50 billion in 60 years. That’s not the point here.</p>
<p>Furber’s point is this: if the energy cost per instruction had not improved by such a huge amount since 1948, this world would be a very different place. There would be no cell phones, no iPads, no personal computers, no personal music players, and very few embedded systems of any sort. These would simply be impractical for reasons of all three “P”s: price, performance, and power.</p>
<p>We have relied almost exclusively on Moore’s Law to get to this point.</p>
<p>That ride’s over.</p>
<p>At today’s bleeding-edge IC fabrication process lithographies, 28nm, we’re imaging individual atoms. Layers are a handful of atoms thick. The number of atoms in a transistor is so shockingly few that dopant atoms no longer operate statistically. The resulting on-chip parametric variability is becoming a very real problem that forces physical designers to use bigger and bigger guard bands on design rules. Speed and power gains are slowing from IC generation to generation. We have arrived at the point of rapidly diminishing returns and we’re clearly not getting another factor of 50 billion improvement in the power needed to execute a machine instruction from here on.</p>
<p>Yet the guidepost pointing to lower power operation is frustratingly close and familiar. It sits between your ears. We have chosen to design processors that execute one (or perhaps a few) instructions at one time, but at a very high execution rate. The higher the better. The brain is designed with an entirely different approach. It’s a highly parallel machine where “parallel” means a lot more than 18 processors. The brain contains approximately 10^11 neurons with 10^15 synapses. The neurons are the brain’s processors and the synapse connectivity is the brain’s memory and programming.</p>
<p>Neurons are very simple and very slow processors, but there are a lot of them working in parallel.</p>
<p>The entire brain human operates at roughly 100W—about the power consumption of a PC processor—but the brain runs at 100Hz. Although we can certainly get a lot of processing done with 100W, it’s not a drop in the bucket compared to the brain’s audio and visual processing abilities, let alone its ability for abstract thought. And we can’t get anything done at 100Hz. Our programming models cannot currently accommodate brain-style processing. We do not yet understand parallelism on the brain’s scale.</p>
<p>In addition, our processing systems are remarkably intolerant of failure. Microprocessors represent single-point failure nodes in most embedded designs with a few exceptions such as majority-voting avionics systems where single-point failure usually means death, so we go “massively” parallel with three processors.</p>
<p>The brain however is very tolerant of failure. Our brains lose neurons all the time. In fact, some of us hurry that process a bit by regularly drinking alcohol and killing off a few extra neurons a day. So what? When you’ve got 10^11 neurons, you’re not going to miss a few and of course the brain doesn’t.</p>
<p>The goal of the SpiNNaker project is to create an early parallel platform that will allow brain researchers to study the operation of a machine that can digitally emulate mechanisms that the brain uses to process a wide range of sensory data, to control an incredibly complex system of muscles and organs, to deal with the complex issues of written and spoken language, and to make huge leaps in abstract thought. SpiNNaker will not produce a leap by a factor of 50 billion, but perhaps it will get us going on the right path, now that we’ve managed to come this far.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/03/30/the-incredible-vanishing-power-of-a-machine-instruction-is-this-the-way-to-the-brain/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Xilinx Zynq EPPs create a new category that fits in among SoCs, FPGAs, and microcontrollers</title>
		<link>http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/#comments</comments>
		<pubDate>Tue, 01 Mar 2011 11:30:14 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[FPGA]]></category>
		<category><![CDATA[IP]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[LPDDR]]></category>
		<category><![CDATA[LPDDR2]]></category>
		<category><![CDATA[SOC]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=502</guid>
		<description><![CDATA[After telegraphing its punch at ESC last spring, Xilinx has now introduced the first four members of its EPP product line and named them Zynq to differentiate them from the company’s FPGAs. (See “Xilinx redefines the high-end microcontroller with its &#8230; <a href="http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>After telegraphing its punch at ESC last spring, Xilinx has now <a href="http://www.prnewswire.com/news-releases/xilinx-introduces-zynq-7000-family-industrys-first-extensible-processing-platform-117132003.html" target="_blank">introduced</a> the first four members of its EPP product line and named them Zynq to differentiate them from the company’s FPGAs. (See “<a href="http://low-powerdesign.com/sleibson/2010/05/01/xilinx-redefines-the-high-end-microcontroller-with-its-extensible-processing-platform-%E2%80%93-part-1/" target="_blank">Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Part 1</a>” and “<a href="http://low-powerdesign.com/sleibson/2010/05/01/xilinx-redefines-the-high-end-microcontroller-with-its-arm-based-extensible-processing-platform-%e2%80%93-case-studies-%e2%80%93-part-2/" target="_blank">Xilinx redefines the high-end microcontroller with its ARM-based Extensible Processing Platform – Case Studies – Part 2</a>”.) Two of the four Zynq family members are designed for low-power applications and the other two emphasize performance over power. “What’s an EPP?” you might ask. It’s an “Extensible Processing Platform,” a new IC category Xilinx hopes to create. Think of an EPP as an embedded processor with an attached FPGA fabric. “Haven’t they tried this before?” you’re now asking. Yes, they have. This time, the difference is that Xilinx is emphasizing the “processor” aspect of the device over the FPGA aspect—and you can expect that change in emphasis to make all the difference.</p>
<p>The Xilinx Zync EPP family is designed to wedge in between ASICs or SoCs, microcontrollers, and FPGAs. What Xilinx has done is leverage its 28nm expertise—earned from its development of the company’s Artix/Kintex/Virtex-7 FPGAs—and used that  expertise to develop a new type of product that’s mostly hardened processor cores (with associated memory and peripherals) and then added a layer of FPGA fabric, like icing on a cake, to produce a new confection. With the smaller Zynq parts selling for less than $15 in volume, these confections will clearly catch the eye of many, many system designers trying to get the most bang for their silicon buck. Zynq EPPs will be available in first silicon starting in the second half of 2011 with general engineering samples available in 1H2012.</p>
<p>Here’s a family block diagram of the Xilinx Zynq EPPs:</p>
<p><img class="aligncenter size-full wp-image-523" title="Xilinx Zynq Block Diagram v2" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/03/Xilinx-Zynq-Block-Diagram-v2.jpg" alt="Xilinx Zynq Block Diagram v2" width="580" height="494" /></p>
<p>At their hearts, each of the four Xilinx Zynq EPPs is a dual-core embedded processor based on two 800MHz ARM Cortex-A9 processors. Each processor is augmented with a copy of ARM’s NEON SIMD engine, a double-precision floating-point unit, 32 Kbytes of instruction cache, and 32 Kbytes of data cache. The two processor cores share a 512Kbyte unified L2 cache. Separate memory controllers, one for DRAM and one for Flash, connect the processor cores to external memory. You need two controllers because DDR DRAMs and Flash devices require radically different control algorithms for optimum operation.</p>
<p>There are a large number of additional peripherals on these chips—all in hard-core form—including two Gigabit Ethernet controllers; two USB 2.0 ports (with USB On-The-Go capability); two SDIO ports for talking to SD Flash media cards; two UARTs; two CAN bus controllers for automotive applications; two 12-bit 1Msample/sec A/D converters with 17 analog inputs; two I2C ports and two SPI ports for talking to serial peripherals; some GPIO pins for whatever else you need to talk to; and an 8-channel DMA controller to move data around the chip.</p>
<p>So far, the Zynq EPPs look like very nice, dual-core embedded processors. What happens next is part of Xilinx’ strategy to create an entirely new product category. Using the ARM AMBA 4 AXI4 interconnect as a connection matrix, Xilinx has driven four 32-bit and four 64-bit AXI4 ports into a block of FPGA fabric. The point of the included FPGA fabric is to allow system designers to create peripheral devices not already on the chip in hard-core form. (Note, Cadence introduced a <a href="http://eda360insider.wordpress.com/2011/02/28/cadence-rolls-out-huge-vip-catalog-merging-verification-ip-from-cadence-with-vip-from-denali-acquisition/" target="_blank">new verification IP catalog</a> with an AMBA4 VIP model just yesterday.)</p>
<p>The actual FPGA fabric capacity included on the Zynq EPPs ranges from 30,000 to 235,000 logic cells, depending on the Zynq family member. Xilinx will tell you that those logic-cell capacities are approximately equivalent to 430,000 to 3.5 million ASIC gates. How did Xilinx get these equivalent numbers? By multiplying by 15. Where did “15” come from? It’s an average, derived from the observation that one logic cell appears to do the job of 10 to 20 ASIC gates across a range of designs. Are the “ASIC gates” equivalencies accurate? Looks like plus or minus 33% to me. The Zynq FPGA fabrics also house block RAMs ranging in capacity from 240 Kbytes to 1.86 Mbytes and they include the usual MACs now commonly found in FPGA fabrics.</p>
<p>Each AMBA4 AXI4 port that bridges the processor complex to the FPGA fabric has a dual arbiter to handle simultaneous accesses from the various masters on the chip. A ninth port, based on the ARM Cortex-A9 ACP (accelerator coherency port) connects the processors’ snoop control unit to the FPGA fabric. The ACP provides a device, such as an external DMA controller, with direct access to CPU-coherent data regardless of where the data is in the CPU cache and memory hierarchy.</p>
<p>The two members of the Zynq family designed for low-power applications incorporate an FPGA fabric based on Xilinx’ Artix-7 FPGAs and the two high-performance members of the Zynq family incorporate an FPGA fabric based on the company’s Kintex-7 FPGAs. The two high-performance Zynq devices also sport either four or twelve 10.3Gbps serial transceiver channels and a PCIe Gen2 controller (4- or 8-lane depending on the Zynq family member).</p>
<p>Notably, it’s the hard-core processor section of the Zynq device that powers up first after a reset, which allows the OS to boot and some of the application code to start executing. This is a familiar environment for any embedded software team. After the processors are up and running, the code can then configure the FPGA fabric.</p>
<p>Here’s a table of the key attributes for the four initial members of the Xilinx Zynq EPP family:</p>
<p><img class="aligncenter size-full wp-image-504" title="Xilinx Zynq Family Table" src="http://low-powerdesign.com/sleibson/wp-content/uploads/2011/02/Xilinx-Zynq-Family-Table.jpg" alt="Xilinx Zynq Family Table" width="600" height="357" /></p>
<p>Enough about the Zynq silicon. The development tools are equally important for such an extensively programmable and configurable device. Xilinx will be providing a $495, Eclipse-based Platform Studio Software Development Kit for the Zynq family. The on-chip ARM Cortex-A9 processor cores open the wide world of ARM’s development ecosystem is open to design teams using Zynq parts.</p>
<p>There are at least a couple of alternatives for developing peripheral blocks in the Zynq EPP FPGA fabrics. The Xilinx ISE Design Suite is the company’s standard FPGA development environment so any designer accustomed to developing logic designs with Xilinx FPGAs will feel at home. The design suite includes both development tools and plug-and-play peripheral IP with AMBA4 AXI4 interfaces that can be dropped into place on the chips. Xilinx has standardized on the AMBA4 AXI4 interconnect standard for its IP block interfaces for both EPPs and FPGAs. Hence the eight AMBA4 AXI4 ports on the Zynq parts. The Xilinx IP blocks also include bus-functional models for system simulation.</p>
<p>Xilinx has created a compelling value proposition with the new Zynq EPPs. It’s quite common for system-design teams to couple some sort of embedded processor with an FPGA in many designs that haven’t the volume needed to justify the design of a custom SoC. The Zynq EPPs offer yet another alternative—one that merges a dual-core embedded processor with a state-of-the-art FPGA fabric and connects the two with a high-bandwidth connection. Moreover, the Xylinx Zynq EPPs give system designers access to 28nm process technology at a relatively low component cost, low NRE (no need to redesign the processor complex), and zero mask and fab costs.</p>
<p>This mixture of capability, performance, and cost simply cannot be replicated with a 2-chip design. Going forward, few system-design teams will be able to avoid at least considering Zynq EPPs in their preliminary architectural explorations. Sure, if you’re building a mobile telephone handset, then a Zynq EPP clearly isn’t for you. If a low-cost microcontroller selling for a buck or so will do the job, that’s an obvious right choice. Custom SoCs still win the day for high-volume, low-power, high-performance applications. For in-between system designs, Zynq EPPs seem like they’re going to be mighty attractive.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/03/01/xilinx-zynq-epps-create-a-new-category-that-fits-in-among-socs-fpgas-and-microcontrollers/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>18th Annual Electronic Design Process Symposium brings together the top thinkers of the EDA world, April 7-8, Monterey, CA</title>
		<link>http://low-powerdesign.com/sleibson/2011/02/25/18th-annual-electronic-design-process-symposium-brings-together-the-top-thinkers-of-the-eda-world-april-7-8-monterey-ca/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/02/25/18th-annual-electronic-design-process-symposium-brings-together-the-top-thinkers-of-the-eda-world-april-7-8-monterey-ca/#comments</comments>
		<pubDate>Fri, 25 Feb 2011 07:30:13 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[EDA]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[SOC]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=497</guid>
		<description><![CDATA[EDPS (The Electronic Design Process Symposium) provides an exchange of ideas among the top thinkers, movers, and shakers who focus on how chips and systems are designed in the electronics industry. Attendees of this elite workshop have met each year &#8230; <a href="http://low-powerdesign.com/sleibson/2011/02/25/18th-annual-electronic-design-process-symposium-brings-together-the-top-thinkers-of-the-eda-world-april-7-8-monterey-ca/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>EDPS (The Electronic Design Process Symposium) provides an exchange of ideas among the top thinkers, movers, and shakers who focus on how chips and systems are designed in the electronics industry. Attendees of this elite workshop have met each year in Sand City (Monterey, CA) since 1993. It has attracted some of the most far-seeing people in electronics as speakers. (Plus me.) It’s a forum for the Design community to discuss EDA’s state-of-the-art with an eye towards improving electronics design processes and EDA/CAD methodologies, rather focusing on individual tools themselves.</p>
<p><strong>Schedule:</strong></p>
<p><strong>Thursday, April 7, 2011 </strong></p>
<ul>
<li><strong>8:30 AM:</strong> Check-ins and On-site Registration</li>
<li><strong>9:00 AM:</strong> Morning Keynote Speaker</li>
<li><strong>10:00 AM:</strong> Parallel EDA</li>
<li><strong>Noon:</strong> Keynote Speaker</li>
<li><strong>1:30 PM:</strong> High-Level Design</li>
<li><strong>4:30 PM:</strong> Cloud Computing</li>
</ul>
<p><strong>Friday, April 8, 2011</strong></p>
<ul>
<li><strong>9:00 AM:</strong> Low-Power Design</li>
<li><strong>11:30 AM:</strong> 3D ICs</li>
</ul>
<p>This is an inexpensive conclave with an emphasis on discussion and networking. It’s only $280 for IEEE Members; $100 for unemployed IEEE members. Save $50 through March 18th.</p>
<p>Full information and registration: <a href="www.eda.org/edps" target="_blank">www.eda.org/edps</a></p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/02/25/18th-annual-electronic-design-process-symposium-brings-together-the-top-thinkers-of-the-eda-world-april-7-8-monterey-ca/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The hidden low-power virtues of IP Subsystems</title>
		<link>http://low-powerdesign.com/sleibson/2011/02/08/the-hidden-low-power-virtues-of-ip-subsystems/</link>
		<comments>http://low-powerdesign.com/sleibson/2011/02/08/the-hidden-low-power-virtues-of-ip-subsystems/#comments</comments>
		<pubDate>Tue, 08 Feb 2011 03:39:23 +0000</pubDate>
		<dc:creator>sleibson321</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[IP]]></category>
		<category><![CDATA[Low-Power]]></category>
		<category><![CDATA[SOC]]></category>

		<guid isPermaLink="false">http://low-powerdesign.com/sleibson/?p=493</guid>
		<description><![CDATA[I’ve been writing about Semico’s new IP report for the last couple of weeks over on my EDA360 blog called the EDA360 Insider. (See: “Are IP subsystems the next big IP category?”, “Semico report lists the six biggest issues challenging &#8230; <a href="http://low-powerdesign.com/sleibson/2011/02/08/the-hidden-low-power-virtues-of-ip-subsystems/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I’ve been writing about Semico’s new IP report for the last couple of weeks over on my EDA360 blog called the EDA360 Insider. (See: “<a href="http://eda360insider.wordpress.com/2011/01/31/are-ip-subsystems-the-next-big-ip-category/">Are IP subsystems the next big IP category?</a>”, “<a href="http://eda360insider.wordpress.com/2011/02/01/semico-report-lists-the-six-biggest-issues-challenging-asic-and-ic-design-today-do-you-agree/">Semico report lists the six biggest issues challenging ASIC and IC design today. Do you agree?</a>”, “<a href="http://eda360insider.wordpress.com/2011/02/03/semico%e2%80%99s-list-of-10-reasons-why-it%e2%80%99s-taken-so-long-for-soc-design-teams-to-adopt-ip-how-many-apply-to-your-team/">Semico’s list of 10 reasons why it’s taken so long for SoC design teams to adopt IP. How many apply to your team?</a>”, and “<a href="http://eda360insider.wordpress.com/2011/02/07/more-on-ip-subsystems-from-semico%e2%80%99s-ip-report/">More on IP Subsystems from Semico’s IP report</a>”.) All of these discussions revolve around Semico Senior Analyst Rich Wawrzyniak’s concept of IP Subsystems. If you’re not familiar with this concept, you’re not alone, but most of the top 50 semiconductor vendors employ IP Subsystems to save them project time, energy, resources, and time to market when they design chips.</p>
<p>You’re no doubt quite familiar with IP blocks by now if you are at all involved in IC and SoC development. IP blocks are chunks of proven design, consisting of RTL descriptions or possibly hardened designs targeting specific IC process technologies. These IP blocks are equivalent to the 24-, 28-, and 40-pin DIPs of yesteryear—ICs that embodied specific functions such as DMA, video, floppy, and hard-disk controllers. Those chips helped system designers quickly build microprocessor-based systems by providing substantial design expertise from the semiconductor vendors but pre-packaged in silicon and epoxy. IP blocks are a lot like that in my mind, although there are some people who have spent their entire lives in the world of IC design who dispute any suggested similarity.</p>
<p>Beyond saving time, (human) energy, resources, and time to market, IP design blocks from reputable vendors also provide some pretty substantial benefits that often seem underappreciated. That’s what I want to discuss in this blog post. First there’s the correctness of the design for purchased IP. Now if you’ve licensed a few IP blocks already, that last statement might cause you to spew your chocolate milk (or your Starbuck’s Iced Skinny Caramel Mocchiato) from your nose. Not all IP is quality IP and when it’s bad—as the saying goes—it’s awful. On the other hand, when it’s good, it’s very, very good.</p>
<p>What makes it good? All of the pioneers in front of you who took the arrows in the chest trying out the IP. Those pioneers discovered the rough edges so you didn’t need to. Think there are never any rough edges? You must be the only engineer in the world who produces first-time-right designs first time, every time. Everything needs to be debugged—if it’s complex enough. And make no mistake, everything we do at 90nm and below is complex. Really, really complex. And the further you go down the lithographic rabbit hole, the more complex things become.</p>
<p>But “good” used as a descriptor for IP has many facets and error-free functional operation is only one of them. You expect purchased IP to work—at least you should expect that. Actually, you should demand it. But you can get more from purchased IP and for the purposes of this Web site, LowPowerDesign.com, you should expect one of those extra facets of goodness to be proven low-power operation. That starts with, but certainly doesn’t end with, clock gating. There are excellent automated tools that can discover many opportunities to gate clocks that humans simply won’t find—especially in complex IP blocks. The opportunities are especially good for pounding the extraneous clocks out of processor IP because all processors do is execute instructions. Put a big enough, varied enough instruction stream through a processor and you’ll find the vast majority of clock-gating opportunities.</p>
<p>But there are other opportunities to cut power consumption as well. There are slow nodes to detect where drive strength can be reduced and where transistors with higher threshold voltages can be inserted to cut leakage.</p>
<p>IP Subsystems can provide even more opportunity to cut operating power if the subsystems are large enough. IP block designers can insert power islands into subsystems allowing the SoC to power up only the portions of an IP Subsystem needed to meet the immediate performance demands. For example, video and audio subsystems can be made amenable to this sort of power savings.</p>
<p>However, you usually won’t see this sort of deep, detailed design in use-once IP. Why? I think because it’s just not worth the effort and there’s no time to do it for a one-off IP design. The motivation’s only there if the IP blocks is going to be licensed over and over again. In general, I think that only a commercial IP vendor can expend the required effort because of the expected return on investment. Going forward, the power-saving abilities of complex IP blocks—IP Subsystems—will help determine the commercial viability of the IP.</p>
]]></content:encoded>
			<wfw:commentRss>http://low-powerdesign.com/sleibson/2011/02/08/the-hidden-low-power-virtues-of-ip-subsystems/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.615 seconds -->

