Tuesday, May 18. 2010
Modern Graphic Processing Units (GPU) experience a big hype about their stream processing capabilities. In fact, GPU companies like AMD (ATI) and NVIDIA invest serious money to transform their former rasterizer hardware into general purpose computation engines. General Purpose GPU (GPGPU) is en vogue and attracts individuals and businesses from completely different application areas. The term GPGPU is synonymous for number crunching, high-performance/super computing, financial monte carlo simulation, etc.
How comes that popularity? The top three reasons for the fast adoption of GPGPU are:
Graphic Processing Units (GPUs)
The evolution of GPUs is rooted in the 3D graphics pipeline and was heavily disrupted by the introduction of shaders. Shaders are small programs written in a low-level assembler like language. As the name implies, shaders where used to process pixels and vertices before being rasterized in graphics memory. Figure 1 sketches the general dataflow. Big/small arrows indicate bandwidth capacities between pipeline stages. Vertex and pixel processing stages are composed of several sub-steps. I've omitted them for brevity.
Note, that the GPU can leverage its peak performance only if the data throughput is routed from/to its tightly connected video memory. The bottleneck is obviously the interface to system memory. But this constitutes no penalty for a graphics system. The asymmetric throughput is optimized for generating high resolution images with frame rates greater 30 fps from 3D-primitives. (To define a rectangle of arbitrary size you need two coordinate pairs (x1, y1) and (x2, y2) but the amount of pixels covered by the rendered rectangle on screen (in display memory) is proportional to width times height. You're getting the idea.) Textures and vertices are uploaded in advance before being fed into the pipeline. Each pipeline stage increases the amount of data by adding additional information like surface normals, texture coordinates etc.
Since the fixed-function pipeline was limited to a certain set of pre-defined operations a more flexible and programmable pipeline was derived by introducing vertex and pixel shaders (see figure 2). Shaders were a big win for GPU developers. Programmers regained control over the graphics pipeline and GPU vendors were forced to disclose information about the underlying hardware yielding better and more efficient software.
Pixel shaders are often called kernels because they apply the same set of operations to every pixel without knowledge of neighboring pixels. (Vertex shaders work in a similar way but process vertices instead.) This data independency makes them very powerful because multiple kernels can run at the same time without interference. Thus shaders implicitly scale very well (in terms of parallelization) by adding more hardware processing resources. And that's exactly what GPU manufacturers did to meet the demands of their customers: they added more shader processors, introduced vector operations and supported standard floating point formats. They effectively created a new platform for data parallel stream processing.
Field Programmable Gate Arrays (FPGAs)
FPGAs never had a fixed processing pipeline. These devices were not designed for signal or image processing in the first place. Early FPGAs served as glue logic or replaced standard logic ICs. Hence they are categorized as programmable logic.
The internal layout of FPGAs is dominated by a large routing matrix capable of connecting thousands of tiny processing elements called lookup tables (LUT) or function generators (FG). They are arranged in a rectangular grid sometimes called fabric (see figure 3). Each LUT or FG has one or more dedicated flipflops (FF) for storing data or creating deeply pipelined designs. A LUT can generate every logic function from n-inputs (where n is vendor dependent and ranges from 3-6). It can also have accompanying multiplexers or fast lines for carry signal propagation to support arithmetic operations. The routing matrix and LUT functions are freely programmable enabling the FPGA developer to build complex logic. Additional on-chip memory (of several MBit size) can be used for FIFOs or buffering purposes. Lately Multiply-Accumulate (MAC) units were added to the fabric to gain market share of traditional DSPs. High-end devices contain approx. 1000 MAC units. Running all of them in parallel leads to very impressive performance specs.
In theory, these building blocks enable you to build entire CPUs or even GPUs as long as there are enough LUTs. But the evolution of FPGA's went a different way. Programmability comes at high costs. Imagine all the transistors for configuration of the routing matrix or LUTs. And who will start to implement a CPU if you can get microcontrollers with full development toolchain support for single-digit bucks? FPGA vendors reacted by placing commonly used blocks like memory controllers, CPU cores, Ethernet MAC and high-speed serial transceivers (e.g. for PCIe) close to the fabric. No more bandwidth bottlenecks or memory shortage. This clever mix of high-speed IO and arithmetic/logic units turned FPGAs into todays reconfigurable System-on-a-Chips (see figure 4).
Many ventures tried but two companies dominate the market today: Xilinx and Altera. Compared to GPU vendors, these companies operate in a completely different market segment and their target audience is not the end customer. They compete with ASICs, ASSPs and digital signal processors (DSP). You will find their FPGAs in many electronic systems, primarily in embedded systems e.g. backbone telecom routers. FPGAs are complex semiconductors tailored for specific applications. However, these devices remain freely programmable and reconfigurable. It's the designer defining the internal operation and that makes them so valuable and unique.
Both technologies, GPU and FPGA, are built for stream processing. Depending on the problem you're facing one or the other may solve it better or worse. Development cycles are usually shorter for GPUs because of their well-known programming model. FPGAs perform better in real-time environments when latencies need to be low. The most significant aspect is that a compiled FPGA design results in real hardware.
But it's not only technical factors constraining your design decision. It requires developers with very specific skills and experience. And those are hard to get...
Sunday, March 8. 2009
You can guess it from the title. I've attended embedded world this year (again). This is my personal review from Europes biggest exhibition about embedded technologies held in Nuremberg (Germany) from March 3rd to 5th.
I like the narrow focus on embedded hardware, software and services. The whole event fits in only four pavilions but has plenty to offer. Unlike some big events (e.g. electronica, Cebit) it's still growing: a 25% increase in exhibitors compared to last year. (I've found no official statistics about the visitor numbers, yet.)
As last year I took the train and arrived around 11am in Nuremberg. Passing by the first busy booths gave me a good feeling. It seemed like recession was dispelled from these pavilions. Maybe I'm wrong, but the general mood of the visitors didn't show a sign of crisis. Of course, everybody is aware of the exceptional situation and I expect more people getting laid off in the electronics industry but despite of all bad news the booths showed high load.
I've noticed an interesting trend going on in the computer on modules (COM) business. Multiple vendors are jumping on the bandwagon of increased system-integration. In order to reduce the number of components, save PCB space and power they start to offer modules with on-board FPGAs. The FPGA is connected through a 1x PCIe lane to the chipset and provides external (serial) connectivity for I2C, CAN, Ethernet etc. This sounds contradictory to reducing power? Wait. The best news is that you have access to the unused FPGA fabric and can insert your logic there. Even better, get rid of the CAN core and friends and occupy it all!
In the software pavilion I've went to the Trolls (aka Qt Software) and talked about the latest Qt release. Starting with version 4.5 the Qt framework is also offered as LGPLed package. That means you can link your closed source applications to the library without paying any license fees. That's good news for independent developers and micro ISVs. I was wondering how Qt Software is making money now. I got a lengthy explanation which can be summarized to: More developers, more mobile applications, higher Nokia phone sales. Nokia pays the bills. Additionally, the Qt Extended stand-alone product is discontinued. The last release will be 4.4.3. All Qt Extended features will be moved later into the Qt framework.
So far. In case things should get worse next year we can still ask Gov. Schwarzenegger for a keynote...erm click here.
Sunday, March 1. 2009
Recently Xilinx caught my attention with the announcement of their next generation Virtex and Spartan FPGA platforms. Having worked with the Spartan series I was really excited to see the version number double (!) from 3 to 6. Yes, say hello to Spartan-6. And...Virtex-6 of course but I'm focusing on the low-cost family for now since it got the overall (and long awaited) face lift.
The first three things I've noticed were (1) the addition of serial transceivers, (2) an increased MUL : LUT ratio and (3) memory controller/PCIe endpoint blocks. Finally! Finally, Xilinx made it and added built-in support for high-speed serial communication. No need to look over with envy at Altera or Lattice with their low-cost families anymore...wait a minute. Altera announced new Arria II GX devices and Lattice is about to release the ECP3M family. But don't expect a comparison now. More or less, the three points apply to all low-cost families across vendors:
(1) The Spartan-6 platform is split into two classes: seven devices without high-speed serial I/O and four devices with 2 / 4 / 8 GTP low-power transceivers/PCIe. While Altera and Lattice are pushing the second generation of low-cost devices with serial transceivers out of the door it's Xilinx first device in the price-sensitive market. And it's time. High-speed serial connectivity is becoming the emerging standard in complex embedded systems. And believe me systems will get even more complex and demand more bandwidth in the future. To overcome complexity and reduce development time vendors need to support a well established communication standard. And what USB is for the desktop market will PCIe become for embedded/industrial applications.
(2) Marketing folks identified mass market audio and video applications as typical innovation areas. These applications are driven by digital signal processing (DSP) and require lots of computing power. So give 'em multipliers + accumulators aka DSP slices aka (sys)DSP blocks! That's my second point: an increased number of multipliers per logic resources. Up to 182 DSP48E1 slices in the biggest Spartan-6 devices are waiting for your DSP algorithms. And DSP is everywhere especially inside todays FPGAs. Or are you (mis)using them for glue-logic?
(3) Memory controller blocks are just another step towards systems-on-chip. They'll become handy when synthesizing a soft CPU core into the fabric or when large amounts of data need to be buffered. Lattice had them already built in the ECP2M Programmable I/O. They really set the benchmark in the price-sensitive market then. I can imagine that Xilinx lost some customers to Lattice here...
However, competition goes on. You can meet all three FPGA vendors at embedded world in Nuremberg this week from Tuesday to Thursday (3.-5 March 2009) and give them feedback on their latest technology.
Saturday, November 29. 2008
Granted, I've left out the important numbers in the title. To be more precise: this post is only for those of you maintaining an existing PCB design containing one or more Xilinx Virtex-II devices.
Many times it happens that you're forced to touch the board again because of discontinued parts (no, not the Virtex-II), minor circuit improvements or feature requests from your customers. So why not upgrade the Virtex-II to a package-compatible Spartan-3A at the same time?
No way!? Why should you do that? Valid points but before I'm going to explain why it could make sense in certain situations I want to introduce the prerequisites for the upgrade:
If these (heavy) requirements apply to your design this could be your migration path:
With the extension of the Spartan-3A family in August 2008 Xilinx now ships the entire device family in the FT256 package. The FT256 is slightly thinner but FG256 "compatible". Now, it's possible to replace a small to mid-range Virtex-II by using a large Spartan-3A.
In the table above I did the logic, multiplier and DCM resources comparison for you to find the appropriate replacement candidates (see third column). The second column lists all Spartan-3A devices that have the same amount of CLB logic than their Virtex-II "counterpart" but less multiplier/BlockRAM resources. So if your design doesn't use 100% of these non-CLB resources they might be an alternative, too. That's why I've called them potential upgrade parts.
Now back to the question: Why? I know it's a lot of work and not as easy as replacing just the FPGA in the next board revision. It requires adjusting the core voltage, changing the layout, rewriting constraints, updating production specs etc. etc. But it will pay off in the following points:
Which I consider valid reasons for small to large volume products and especially battery powered applications. Please note, this post is not meant to be a HOWTO or guide nor can I guarantee you that it works for all designs. I wanted to share the idea and research work with you. In case you find this interesting or want to exchange experiences I would be glad to hear back from you.
(Page 1 of 1, totaling 4 entries)