Solve Portable Design Problems Using Convenient Concurrency
SMP multicore processors offer many advantages in portable products—if they’re properly designed.
Discussions of multicore chips, multiprocessors, and associated programming models for portable system design continue to be narrowly bounded by a focus on individual, general-purpose processor architectures, DSPs, and RTL blocks, which severely limits the possible ways in which you might use multiple computing resources to attack problems. Big semiconductor and server vendors offer symmetric multiprocessing (SMP) multicore processors, with each core supporting multiple threads. Such multicore chips are found in large servers and laptops. However, these power-hungry, general-purpose multiprocessor arrays do not serve well as processing models for many portable systems.
Large servers and farms support applications such as web query requests that follow a “SAMD” model: single application, multiple data (an oversimplification, perhaps, but a useful one). SAMD applications date back to early mainframe days when computers were dedicated to one application such as real-time airline reservations and check-in systems or real-time banking. These big applications now run on servers–many are web-based–and these applications are particularly suitable for SMP multicore processors; all of the processors run the same kind of code, the programs do not exhibit data locality, and the number of cores running the application makes no material difference other than execution speed.
What stimulates a lot of interest, excitement, and worry these days is the application of the same SMP multicore chips to embedded designs, particularly portable products. Here, the main concern is that very few applications running on such machines are “embarrassingly parallel” applications that can be cut up into multiple threads, each acting in parallel on part of the data. Graphics and video processing are embarrassingly parallel and that parallelism is exploited in special-purpose graphics engines such as the IBM-Toshiba-Sony Cell processor (interestingly, not really an SMP multicore machine) and PC graphics chips offered by Nvidia and others.
Software Engineers Don’t Think in Parallel
Thinking about applying such SMP architectures to portable systems immediately draws attention to a tool problem: practical software-development tools that can automatically distribute a large, single-threaded application across many processors are simply not available. While hardware-description languages such as Verilog easily express parallel operations, software languages such as C, the current king of embedded software languages, are specifically designed to express single-threaded algorithms. A dilemma.
There have been many attempts—such as Concurrent C, Unified Parallel C (UPC), mpC, pC, and others—to extend C into the parallel-programming domain. Sometimes these approaches use special libraries and APIs to allow explicit identification of parallel processes and the communications between them. MPI and OpenMP come to mind. Other researchers have attempted to create entirely new software languages that implicitly incorporate parallel programming structures, or explicitly allow concurrency to be expressed (remember Occam for the Transputer?). It may be that we are so steeped in single-tasking algorithmic culture (recipes, business procedures, first-aid techniques, etc.) that we have a hard time visualizing concurrent processes. For whatever reason, it appears to be very hard to train software programmers to think in terms of parallel operations. Barring breakthroughs in programmer training or in automated software parallelization, the future economics of SMP multicore chips remains perplexing for most portable applications.
However, expanding our architectural thinking beyond SMP multicores uncovers at least two kinds of easily used concurrency that exploit heterogeneous, not homogeneous, concurrency. These approaches are better suited to portable applications. Both such system architectures fit very well into most 21st-century consumer devices including cell phones, portable multimedia players, and multifunction devices.
You might call the first sort of parallelism “compositional concurrency,” where various subsystems—each containing one or more processors optimized for a particular set of tasks—are woven together into a product. Communications within this architectural design style are structured so that subsystems interact only when needed. For example, a user-interface subsystem running on a controller may need to switch audio processing on or off; to control the digital camera; or to manage video processing by stopping, pausing, or changing video playback in some other manner. In this kind of concurrent system, many subsystems operate simultaneously but they interact at only a high level and do not clash.
Figure 1: Super 3G Cell Phone Block Diagram
Figure 1 shows a block diagram of a Super 3G mobile phone that illustrates this idea. There are 18 identified processing blocks (shown in gray), each with a clearly defined task. In this example, it’s easy to see how one might use as many as 18 processors (or more for sub-task processing) to divide and conquer this problem.
Some criticize this sort of architectural design style because it’s theoretically inefficient in terms of gate and processor count. Ten, twenty, or more processor cores could, at least in theory, be replaced with just a few general-purpose cores (perhaps SMP coherent multicores) running at much higher clock rates.
This criticism is misplaced. While Moore’s Law (providing more transistors per fabrication node) marched in lockstep with Denard (classical) scaling (which provided faster, lower-power transistors at each fabrication node), the big, fast processor design style held sway. Denard scaling curtailed at 90nm; power dissipation and energy consumption become unmanageable at high clock rates; and system designers must now adopt design styles that reduce system clock rates.
A compositionally concurrent design style offers tremendous advantages:
- Distributing computing tasks over more on-chip processors trades additional transistors in exchange for lower clock rate to reduce overall power and energy consumption. Given the continued progress of Moore’s Law and the end of Denard scaling, this is a good engineering trade off because energy consumption rises superlinearly with clock frequency. In addition, the use of lower clock rates drops the need to run in the fastest possible process technology. Using a low-power process technology at any given fabrication node can reduce leakage current by as much as three orders of magnitude! That’s why lowering clock rate is critically important for portable systems, which are in standby mode most of the time so leakage currents largely determine energy consumption and therefore battery life.
- Dedicated subsystems can be easily powered down when not used. They can also be shut off and restarted quickly.
- Because these subsystems are task-specific, application-specific instruction set processors (ASIPs) that are much more area and power efficient than general-purpose processors can be designed for each processor used in the system so the gate-count advantages of fewer general purpose cores may be much less than it seems at first.
- This design style avoids complex interactions and synchronizations between subsystems that are common with SMP and multithreaded designs. Proving that a 4-core SMP system running a cell phone and its audio, video, and camera functions will not drop a 911 emergency call when other applications are running, or that low-priority applications will be properly suspended when a high-priority task interrupts, often invokes an analysis nightmare—“death by simulation.” Reasonably independent subsystems interacting at a high level are far easier to validate both individually and compositionally.
Divide and Conquer
Design tools to support this type of system-design style already exist in the form of system-simulation tools based on SystemC. Various subsystems can be written in C (reminder: that’s the software-programming language that everyone already knows how to use), can be proven individually, and can then be simulated as a system using instruction-set simulators that are hundreds or thousands of times faster than the gate-level simulators needed for RTL simulation. This speed advantage grants system designers the luxury of trying different system architectures and choosing the best one, instead of today’s situation where system architecture is often selected through the application of “Kentucky windage” (see www.microwaves101.com/encyclopedia/slang.cfm for a proper definition of this technical term). The various simulated subsystems communicate with each other using messaging protocols and the entire design style lends itself well to the strongest practice in all design engineering: divide and conquer.
Pipelined dataflow, the second kind of concurrency, complements compositional concurrency. Computation often can be divided into a pipeline of individual task engines. Each task engine processes and then emits processed data blocks (frames, samples, etc.). Once a task completes, the processed data block passes to the next engine in the chain. Such asymmetric multiprocessing algorithms appear in many signal- and image-processing applications from cell-phone baseband processing to video and still-image processing. Pipelining permits substantial concurrent processing and also allows even sharper application of ASIP principles: each of the heterogeneous processors in the pipeline can be highly tuned to just one part of the task.
Combining the compositional-subsystem style of design (as just described) with pipelined, asymmetric multiprocessing (AMP) in each subsystem makes it apparent that products in the consumer, portable, and media spaces may need 10 to 100 processors—each one optimized to a specific task in the product’s function set. Programming AMP applications is easier than programming multithreaded SMP applications because there are far fewer intertask dependencies (if any) to worry about. Experience shows it is possible to cleanly write software in this manner and many optimization issues arising from the use of multiple application threads running on a limited set of identical processors are simply avoided.
Get Off the Bus
The use of large numbers of configured processors greatly accelerates individual tasks, as shown above. The way these processors are interconnected also greatly affects system performance. Although the usual way to hook all the processors together is to use one central bus, this aged design approach makes little sense with today’s nanometer SOCs. The more processors you saddle on that one bus, the more bus contention you’ll have. You’ll then need to schedule and arbitrate bus access. Suddenly, you’ve created a problem that need not exist because in many systems, particularly AMP systems like the ones just described, each processor need not talk to every other processor.
Figure 2: Direct Connect Versus Buses
Figure 2 shows a system that illustrates this point. Some specific communication paths are needed but most possible processor-to-processor connection paths made possible by a global bus are not. If each of these blocks were simple RTL hardware blocks, we’d simply connect them as shown by the large arrows. For some reason, when the blocks become processors, we feel the need to hook them all to one bus. That’s simply the wrong approach. We intuitively know this for RTL blocks but become oblivious when the blocks become processors.
The right system-design approach is to connect the on-chip processors as demanded by the system architecture. Make your interconnection scheme match the actual needs of the system using buses, queues, and simple parallel ports. “Won’t that add a lot of wires?” you might ask. Yes it will. So let’s look at where our system-design thinking has come from and where’s it’s taken us.
Intel housed the first commercial microprocessor, the 4004, in a 16-pin package. Intel was primarily a memory-chip vendor at the time and the company’s most economical package was a 16-pin DIP. The 4004’s designers developed a multiplexed, 4-bit bus to fit the available package. It may be hard to believe now, but few hardware designers used buses before the advent of the 4004 microprocessor. Earlier systems were built using point-to-point connections with very little or no sharing or multiplexing. Then, wires were cheaper than transistors. After the introduction of the microprocessor, buses came to dominate the way we connected devices in a system. We now use them by instinct without really thinking things through.
The bus has always been a processor bottleneck. Over time, packaged microprocessors have evolved from 16-pin packages to nearly 1000 pins in an attempt to alleviate this bottleneck. In the world of packaged processors, every pin on the package costs money so there’s a bit of give and take between cost and performance. However, that’s simply not the situation with on-chip processors.
Figure 3: Nanometer I/O Routability
If we do the math, we see that nanometer silicon gives you a lot of raw I/O routability. At 90nm, you can route more than 100,000 wires into a square millimeter of silicon. At 65nm, you can route nearly 200,000 wires into and out of each square millimeter. Figure 3 illustrates this idea. Practically speaking, use of wide point-to-point interconnections between on-chip blocks is not outrageous and can be beneficial.
SOC design gives us the ability to make the interconnect scheme match the problem. Perhaps a shared bus is the right approach, but perhaps not. Other schemes for regular interconnect include on-chip networks and cross bars. But in many cases, an approach that connects blocks as demanded by the target application is the most economical, delivers the best performance, and is therefore the best choice.
Those worried that the future will not allow large-scale use of many processors or cores for a wide range of applications should take heart. Indeed, this will clearly be possible, even likely! But it is important for everyone working in these areas to recognize that “There are more things in heaven and earth, Horatio, than are dreamt of in your philosophy.” (Hamlet, Act I, scene v). Taking the wide view, the world truly is conveniently concurrent!
Santa Clara, CA
This article originally appeared in the February, 2008 issue of Portable Design. Reprinted with permission.