Tuesday, November 2. 2010
This year I've attended Qt DevDays again. And I've found a much bigger venue than previous years. The event is still growing. Roughly 1000 attendees this year. According to the organizers figures double every two years. DevDays moved outside of Munich half way to the airport which is not too bad because you don't have to spent an hour traveling into the city centre anymore. So I was on time to listen to the keynotes and had a chance to grab a coffee before.
Two main topics dominated the event: Qt Quick (Qt User Interface Creation Kit) and the planned Open Governance model for Qt. The first technical topic was demonstrated by Lars Knoll in his keynote and presented in several sessions both days. Basically it's NOKIAs technology to address the app development for all its mobile platforms. The web page reads:
Qt Quick (Qt User Interface Creation Kit) is a high-level UI technology that allows developers and UI designers to work together to create animated, touch-enabled UIs and lightweight applications.
This is an interesting strategical change towards lightweight applications a.k.a. apps to address mobile (NOKIA) devices. Because C++ is to heavy for that, they've added QML "an easy to use, declarative language" deeply integrated with the Qt Creator IDE of course. Obviously, the Qt technology stack will be driven by trends in the mobile world. Feature development for niche *nix platforms will phase out.
The Open Governance model was discussed in the keynote by NOKIA CTO Rich Green and in the Qt Labs. It's good to see a company like NOKIA embracing open source and their community. It's not only discussion that will move into the public, also the QA process will be opened. In the Labs, I got the impression that many details need to be fleshed out. It's not implementation ready but you can clearly identify the trend. And I don't believe it will be easy because this means true change.
Speaking of QA reminds me on something: On day three I've listened to Rohan McGovern in his talk about the "The Qt Continuous Integration System". It was a really impressive and very interesting presentation. Thumbs up, Rohan! It confirmed my experience with QA and CI: choose your tools wisely, dedicate time to it and prepare for a lot of work!
Tuesday, May 18. 2010
Modern Graphic Processing Units (GPU) experience a big hype about their stream processing capabilities. In fact, GPU companies like AMD (ATI) and NVIDIA invest serious money to transform their former rasterizer hardware into general purpose computation engines. General Purpose GPU (GPGPU) is en vogue and attracts individuals and businesses from completely different application areas. The term GPGPU is synonymous for number crunching, high-performance/super computing, financial monte carlo simulation, etc.
How comes that popularity? The top three reasons for the fast adoption of GPGPU are:
Graphic Processing Units (GPUs)
The evolution of GPUs is rooted in the 3D graphics pipeline and was heavily disrupted by the introduction of shaders. Shaders are small programs written in a low-level assembler like language. As the name implies, shaders where used to process pixels and vertices before being rasterized in graphics memory. Figure 1 sketches the general dataflow. Big/small arrows indicate bandwidth capacities between pipeline stages. Vertex and pixel processing stages are composed of several sub-steps. I've omitted them for brevity.
Note, that the GPU can leverage its peak performance only if the data throughput is routed from/to its tightly connected video memory. The bottleneck is obviously the interface to system memory. But this constitutes no penalty for a graphics system. The asymmetric throughput is optimized for generating high resolution images with frame rates greater 30 fps from 3D-primitives. (To define a rectangle of arbitrary size you need two coordinate pairs (x1, y1) and (x2, y2) but the amount of pixels covered by the rendered rectangle on screen (in display memory) is proportional to width times height. You're getting the idea.) Textures and vertices are uploaded in advance before being fed into the pipeline. Each pipeline stage increases the amount of data by adding additional information like surface normals, texture coordinates etc.
Since the fixed-function pipeline was limited to a certain set of pre-defined operations a more flexible and programmable pipeline was derived by introducing vertex and pixel shaders (see figure 2). Shaders were a big win for GPU developers. Programmers regained control over the graphics pipeline and GPU vendors were forced to disclose information about the underlying hardware yielding better and more efficient software.
Pixel shaders are often called kernels because they apply the same set of operations to every pixel without knowledge of neighboring pixels. (Vertex shaders work in a similar way but process vertices instead.) This data independency makes them very powerful because multiple kernels can run at the same time without interference. Thus shaders implicitly scale very well (in terms of parallelization) by adding more hardware processing resources. And that's exactly what GPU manufacturers did to meet the demands of their customers: they added more shader processors, introduced vector operations and supported standard floating point formats. They effectively created a new platform for data parallel stream processing.
Field Programmable Gate Arrays (FPGAs)
FPGAs never had a fixed processing pipeline. These devices were not designed for signal or image processing in the first place. Early FPGAs served as glue logic or replaced standard logic ICs. Hence they are categorized as programmable logic.
The internal layout of FPGAs is dominated by a large routing matrix capable of connecting thousands of tiny processing elements called lookup tables (LUT) or function generators (FG). They are arranged in a rectangular grid sometimes called fabric (see figure 3). Each LUT or FG has one or more dedicated flipflops (FF) for storing data or creating deeply pipelined designs. A LUT can generate every logic function from n-inputs (where n is vendor dependent and ranges from 3-6). It can also have accompanying multiplexers or fast lines for carry signal propagation to support arithmetic operations. The routing matrix and LUT functions are freely programmable enabling the FPGA developer to build complex logic. Additional on-chip memory (of several MBit size) can be used for FIFOs or buffering purposes. Lately Multiply-Accumulate (MAC) units were added to the fabric to gain market share of traditional DSPs. High-end devices contain approx. 1000 MAC units. Running all of them in parallel leads to very impressive performance specs.
In theory, these building blocks enable you to build entire CPUs or even GPUs as long as there are enough LUTs. But the evolution of FPGA's went a different way. Programmability comes at high costs. Imagine all the transistors for configuration of the routing matrix or LUTs. And who will start to implement a CPU if you can get microcontrollers with full development toolchain support for single-digit bucks? FPGA vendors reacted by placing commonly used blocks like memory controllers, CPU cores, Ethernet MAC and high-speed serial transceivers (e.g. for PCIe) close to the fabric. No more bandwidth bottlenecks or memory shortage. This clever mix of high-speed IO and arithmetic/logic units turned FPGAs into todays reconfigurable System-on-a-Chips (see figure 4).
Many ventures tried but two companies dominate the market today: Xilinx and Altera. Compared to GPU vendors, these companies operate in a completely different market segment and their target audience is not the end customer. They compete with ASICs, ASSPs and digital signal processors (DSP). You will find their FPGAs in many electronic systems, primarily in embedded systems e.g. backbone telecom routers. FPGAs are complex semiconductors tailored for specific applications. However, these devices remain freely programmable and reconfigurable. It's the designer defining the internal operation and that makes them so valuable and unique.
Both technologies, GPU and FPGA, are built for stream processing. Depending on the problem you're facing one or the other may solve it better or worse. Development cycles are usually shorter for GPUs because of their well-known programming model. FPGAs perform better in real-time environments when latencies need to be low. The most significant aspect is that a compiled FPGA design results in real hardware.
But it's not only technical factors constraining your design decision. It requires developers with very specific skills and experience. And those are hard to get...
Monday, March 15. 2010
This series of posts is dedicated to the Altium NanoBoard 3000. I had the opportunity to play with it and like to share my experience with you.
The NanoBoard was released in September 2009 by Altium Ltd. It's a rapid prototyping platform for digital electronic designs consisting of an evaluation board, design software and royality free IP for use in the onboard FPGA. In short, all the tools you need to start implementing your ideas. You can stop reading now, buy it from e.g. Newark for $395 and get started. The 3000 is part of Altium's NanoBoard family a complementary product line besides the well know EDA tools. Maybe you're already working with Designer.
When I received the delivery and removed the outer packing I was impressed. Note, I had not opened the box yet. It reminded me of some consumer lifestyle product. It could have been a mobile phone or high-end notebook. Apple is well know for such a kind of great packaging. Not bad for an evaluation board!
Opening the box reveals the board + software. The black PCB with gold contacts has an undeniable elegance. Underneath is a separate box containing accessories, desktop stand, speaker board, IR remote (yes, a remote control) and the power supply.
The Quickstart Guide provides instructions for mounting the desktop stand and connecting the speaker board. After 10 minutes I had the final setup sitting on my desk. The software is shipped on a DVD. Installation on Windows completed successfully after a couple of minutes. No license is required. The eval board is your license/dongle.
Without having worked with it, this is clearly a highlight among the evaluation kits on my shelf. It's obvious that this product was designed by a team of professionals with a precise and common vision in mind. Every detail has been fine-tuned, from the packaging to the PCB. A good example for holistic product design. Thumbs up.
And now, you know why it is worth to consider the in the box experience. More details in the next post...
Thursday, February 19. 2009
Aside from reverse engineering FPGA bitstreams I've finished (Yes!) another pending programming project last week. I was helping out a fellow programmer, let's call him Bob, who was busy doing other things and had no time to hack away a set of high-prioritized items from his TODO list. So I pulled these items from his list over to mine. I have plenty of experience doing these troubleshooting jobs and I always feel like a surgeon carefully inspecting the innards (source code) of the patient (application) before cutting out tumors (bugs) or implanting organs (patches)...
Ok, to come back to the problem: I was supposed to extend a desktop application that was designed and written from ground up by Bob two years ago in C++ (using Qt). Hmmm...C++ and templates. I can hear your brain working now. Right, we're not talking about document, XML or whatever templates here. It's the well known C++ programming language feature (see Template Metaprogramming). I use it regularly. It's a great feature if the compiler fully supports it. If you're familiar with templates feel free to skip the next paragraph.
Templates implement the concept of parameterized types in C++ (Bruce Eckel, Thinking In C++). It's a syntax extension that tells the compiler how to define a type from a generic type declaration. Wow...that was abstract. How does it work? The first time you instantiate a template with a given type (as parameter) in your code this parameter is inserted by the compiler into your template declaration and creates an entirely new type. Ideally templates are designed in such a way that you can throw almost every type on them (as parameter). Yes, even user-defined types. Think of it as another way of reusing code besides the OOP based inheritance approach. The syntax can also be applied to function definitions (aka function templates). Templates are perfectly suited for container classes like lists, queues etc. of arbitrary objects. Personally, I've used templates in digital signal processing algorithms to create number format independent filters amongst others. Other classic/good examples are the C++ Standard Template Library (STL), the Boost C++ libraries or Intel's Threading Building Blocks (TBB).
The challenge with templates is that it's very convenient and easy to use them but significantly harder to create them in a structured and maintainable way. Did I mention that templates are entirely implemented in header files? That may become an issue as I'll explain later. Now guess what happened...the app I was working on was full of templates. Driven by the idea of increasing runtime performance Bob applied the template concept to nearly 100% of the data path related code. His intention was to let the compiler optimize e.g. (inline) nested function calls introduced by the object hierarchy and flatten them out at compile time. No virtual function lookups and stuff in the binary...anymore. Highly instruction level optimized code. The perfect solution for maximum performance...
Problem1: The template concept is no silver bullet for your performance problems. The 80/20 rule still applies: 80 percent of the execution time is spent in 20 percent of your code. (Or was it 90/10?) Use a profiler and analyze your code carefully before delving into advanced and unreadable templates of templates of template constructs.
Problem2: While it might be a (good) academic exercise to create an entirely template based object hierarchy to gain a deeper understanding of the language concept it's probably a bad idea to apply this pattern to your entire framework. First, this makes it hard or even impossible to put your framework later in a library and second, the guy next door (patching your stuff) appreciates readable code.
Problem3: Have you ever tried to debug and/or patch this stuff? When 98% of the files are header files you'll trigger a recompile of the entire project just because you've changed a variable name from i to tableIdx, and not to mention the weird error messages you might get in seemingly unrelated sections of the code (although compilers got better and precompiled headers may reduce the pain a bit). Yes, I'm exaggerating here but you're getting the point.
Don't get me wrong. This is no rant against templates! Letting the compiler write type-safe code for you is of great value. Also, generic datatypes tremendously reduce the amount of duplicated code. But be careful when using this powerful language feature. Use it only where applicable i.e. to maximize code reuse, restrict it to subsets of your framework, keep in mind who might read/use the code and please please don't misuse it for your global optimization strategies.
You want to know the end of the story? Well, just another surgical intervention.
Saturday, January 24. 2009
This is the last post in my series (see part 1 and part 2) about watchdog timers (WDTs). After that one I will stop bugging you with the topic. But probably you've already got the impression I have a certain weakness for WDTs. Anyway let's take this to the next level and find out how to improve the solution from my last post even further.
Right now we have our application directly pinging the watchdog hardware i.e. there's a watchdog module/component linked into the main executable doing all the things required to prevent a timeout. That was easy to do because it only required us to insert two function calls for (de)initialization and one for pinging inside our mainloop. Unfortunately it also introduced a tight coupling between the main application and the watchdog handling logic (see figure 1).
Other drawbacks of this approach are:
To overcome these drawbacks we put the watchdog handling logic into a separate executable and introduce a small and light-weight layer that wraps the interprocess communication (IPC) which is now required. I'll leave the choice of an appropriate IPC API to you. It should be non-blocking and easy to use (SOAP is probably a bad idea). Certainly we've added slightly more complexity to the big picture (see figure 2) but at the same time it removes any dependencies from the main application to the WDT hardware and it can be run even when there is no watchdog present at all.
Having done these architectural changes we're now ready to introduce more logic to improve the WDT behavior. We can even add a user interface to the newly created watchdog application. Please note, that the term watchdog application/process refers now to a piece of software sitting on top of the WDT hardware.
The nonobvious outcome of inserting a watchdog process between the main app and the WDT hardware is that we're getting an additional high-level fallback layer without degrading the original WDT functionality. Both applications + OS remain under WDT control (remember this is a piece of hardware hooked up to the reset line). This gives us more opportunities to take action when the main app stops pinging the watchdog process e.g. we could try to kill and restart the main application process once and ultimately initiate a shutdown sequence which is healthier than the hard reset. And that's not all! Further enhancements may include:
I'm pretty sure you'll have more features to add, but always remember: Keep it simple! It's definitely not the right application to bloat with features.
(Page 1 of 2, totaling 7 entries) » next page