Tuesday, May 18. 2010
Modern Graphic Processing Units (GPU) experience a big hype about their stream processing capabilities. In fact, GPU companies like AMD (ATI) and NVIDIA invest serious money to transform their former rasterizer hardware into general purpose computation engines. General Purpose GPU (GPGPU) is en vogue and attracts individuals and businesses from completely different application areas. The term GPGPU is synonymous for number crunching, high-performance/super computing, financial monte carlo simulation, etc.
How comes that popularity? The top three reasons for the fast adoption of GPGPU are:
Graphic Processing Units (GPUs)
The evolution of GPUs is rooted in the 3D graphics pipeline and was heavily disrupted by the introduction of shaders. Shaders are small programs written in a low-level assembler like language. As the name implies, shaders where used to process pixels and vertices before being rasterized in graphics memory. Figure 1 sketches the general dataflow. Big/small arrows indicate bandwidth capacities between pipeline stages. Vertex and pixel processing stages are composed of several sub-steps. I've omitted them for brevity.
Note, that the GPU can leverage its peak performance only if the data throughput is routed from/to its tightly connected video memory. The bottleneck is obviously the interface to system memory. But this constitutes no penalty for a graphics system. The asymmetric throughput is optimized for generating high resolution images with frame rates greater 30 fps from 3D-primitives. (To define a rectangle of arbitrary size you need two coordinate pairs (x1, y1) and (x2, y2) but the amount of pixels covered by the rendered rectangle on screen (in display memory) is proportional to width times height. You're getting the idea.) Textures and vertices are uploaded in advance before being fed into the pipeline. Each pipeline stage increases the amount of data by adding additional information like surface normals, texture coordinates etc.
Since the fixed-function pipeline was limited to a certain set of pre-defined operations a more flexible and programmable pipeline was derived by introducing vertex and pixel shaders (see figure 2). Shaders were a big win for GPU developers. Programmers regained control over the graphics pipeline and GPU vendors were forced to disclose information about the underlying hardware yielding better and more efficient software.
Pixel shaders are often called kernels because they apply the same set of operations to every pixel without knowledge of neighboring pixels. (Vertex shaders work in a similar way but process vertices instead.) This data independency makes them very powerful because multiple kernels can run at the same time without interference. Thus shaders implicitly scale very well (in terms of parallelization) by adding more hardware processing resources. And that's exactly what GPU manufacturers did to meet the demands of their customers: they added more shader processors, introduced vector operations and supported standard floating point formats. They effectively created a new platform for data parallel stream processing.
Field Programmable Gate Arrays (FPGAs)
FPGAs never had a fixed processing pipeline. These devices were not designed for signal or image processing in the first place. Early FPGAs served as glue logic or replaced standard logic ICs. Hence they are categorized as programmable logic.
The internal layout of FPGAs is dominated by a large routing matrix capable of connecting thousands of tiny processing elements called lookup tables (LUT) or function generators (FG). They are arranged in a rectangular grid sometimes called fabric (see figure 3). Each LUT or FG has one or more dedicated flipflops (FF) for storing data or creating deeply pipelined designs. A LUT can generate every logic function from n-inputs (where n is vendor dependent and ranges from 3-6). It can also have accompanying multiplexers or fast lines for carry signal propagation to support arithmetic operations. The routing matrix and LUT functions are freely programmable enabling the FPGA developer to build complex logic. Additional on-chip memory (of several MBit size) can be used for FIFOs or buffering purposes. Lately Multiply-Accumulate (MAC) units were added to the fabric to gain market share of traditional DSPs. High-end devices contain approx. 1000 MAC units. Running all of them in parallel leads to very impressive performance specs.
In theory, these building blocks enable you to build entire CPUs or even GPUs as long as there are enough LUTs. But the evolution of FPGA's went a different way. Programmability comes at high costs. Imagine all the transistors for configuration of the routing matrix or LUTs. And who will start to implement a CPU if you can get microcontrollers with full development toolchain support for single-digit bucks? FPGA vendors reacted by placing commonly used blocks like memory controllers, CPU cores, Ethernet MAC and high-speed serial transceivers (e.g. for PCIe) close to the fabric. No more bandwidth bottlenecks or memory shortage. This clever mix of high-speed IO and arithmetic/logic units turned FPGAs into todays reconfigurable System-on-a-Chips (see figure 4).
Many ventures tried but two companies dominate the market today: Xilinx and Altera. Compared to GPU vendors, these companies operate in a completely different market segment and their target audience is not the end customer. They compete with ASICs, ASSPs and digital signal processors (DSP). You will find their FPGAs in many electronic systems, primarily in embedded systems e.g. backbone telecom routers. FPGAs are complex semiconductors tailored for specific applications. However, these devices remain freely programmable and reconfigurable. It's the designer defining the internal operation and that makes them so valuable and unique.
Both technologies, GPU and FPGA, are built for stream processing. Depending on the problem you're facing one or the other may solve it better or worse. Development cycles are usually shorter for GPUs because of their well-known programming model. FPGAs perform better in real-time environments when latencies need to be low. The most significant aspect is that a compiled FPGA design results in real hardware.
But it's not only technical factors constraining your design decision. It requires developers with very specific skills and experience. And those are hard to get...
Friday, February 6. 2009
While writing my last post I remembered how I came across parallel programming the first time. I was at university and at that time looking into cracking passwords from hashes e.g. on UNIX-like systems from /etc/shadow using John the ripper. BTW when saying cracking I mean the real brute-force approach, no dictionaries and stuff. The SUN workstations were too slow at that time. So I bought a bunch of 3G base station processor boards at ebay (see picture) armed with 4 DSPs and a PowerPC. Pretty hot stoff that time (and ridicously cheap, about 10 EUR each). They were used by a huge german company (starting with S) in a development project and sold at the end.
I had no clue about FPGAs that time and DSPs were my natural choice. I was young, I knew how to program in C, I was using an open source password cracking tool and the DSPs simply matched my number crunching requirements. What else do I need more? Erm...board manuals, schematics, professional development tools (yep, no gcc port for TMS320C6X available), debugging cables, probably a VME backplane and more knowledge about these architectures as I found out. Finally I managed to power the board using an old PC power supply and connected to the console port. Woohoo...it was still working and the bootloader was looking for a FTP server to pull the OS image from. My first steps into the field of parallel computing...
To bring the story to an end. I've never run any code on the DSPs due to the lack of board manuals, schematics, professional development tools...you name it. However, I've learned that (1) certain computing tasks can be broken down to a restricted set of arithmetic operations but need to be run at the maximum achievable speed and (2) it makes sense for a group of problems to split work among multiple processors (or cores) to finish computations in less time.
From todays perspective I've discovered another interesting aspect which is more related to the overall system concept. If you take a closer look at the architecture of the telco blade, you'll find similarities to modern processor architectures. There are specialized processing units (4 DSPs) grouped around a central processing unit (PowerPC) on the PCB. Does this remind you of something? No? Replace PowerPC with PPE, DSPs with SPEs and PCB with die and you'll get pretty close to the CellBE architecture (see figure below, taken from "Introduction to the Cell Broadband Engine").
Lessons learned from application specific processor boards lead to main stream processor architectures on a single die. I'm wondering if the telecommunications industry is still the number one driver of the processor evolution. Or is it the gaming industry with its demand for high bandwidth graphics hardware that is most influential on modern processor designs?
(Page 1 of 1, totaling 2 entries)