5/21/2013

Introducing Intel Xeon Phi - hardware and installation

Few days ago we have got Intel Xeon Phi 5110P Coprocessor - 60-core x86 accelerator that is direct competitor of GPGPU on HPC market and is based on Intel Many-Integrated Core (MIC) architecture.
As Xeon Phi is a quite new piece of hardware there are very few information in the Internet regarding its usage, installation and compatible hardware (especially motherboards). We would like to fill that gap and write about our experience of assembling custom server with Xeon Phi installed.
Technically, Xeon Phi looks very much like high-end 2-slot PCIe graphics card, except of the fact that it doesn't have DVI output on its back. We have 5110P model with passive cooling, hence there are models with active cooling. Without additional cooling there might be some temperature issues, so make sure that you have a case with additional cooler. Our server case has additional cooling installed and is placed in server room with powerful air conditioning, so no problems for us there.
Not every single motherboard can support this kind of accelerator. The trick is that it has to support 64-bit PCIe addressing and you can find this feature only on modern high-end motherboards. As Intel doesn't have official list of supported motherboards it could be quite tricky to choose motherboard for your Xeon Phi server, in case you buy it separately and assemble by yourself. We managed to run the accelerator with ASUS P9X79 WS motherboard (thanks ASUS for clear information that it supports Xeon Phi). Our motherboard had outdated version of BIOS, so we had to flush the latest version and change settings there in order to enable Xeon Phi support. If you visited a link above you might have noticed a statement that it supports only Xeon Phi 3100 series with active cooling, but in fact it works perfectly with 5110P, just don't forget about additional cooling in your case.
At peak loads Xeon Phi 5110P consumes 225W of power. 3100 series devices use even more, up to 300W, so make sure that your power-supply unit is able to feed it with that amount of energy.
Officially Xeon Phi works only with RHEL and SLES, but in our case it works perfectly with CentOS (open source clone of RHEL).
We are going to write more posts about Xeon Phi very soon. Stay tuned!

4/22/2013

Parallelism in Testing – To Be or Not To Be?


Today I would like to uncover the topic of parallel testing in context of test automation approaches: its potential benefits and possible scenarios.

So let’s dwell on the matter of test automation as we know it.  What can we call a common benefit on this regard?  The most obvious one can be referred to as “resource savings”.  How about we stop for a minute and try to define the tasks that are most frequently automated.  These can be – repetitive inputs, UI navigation, performance measuring, etc.  The next point would be the product areas which are eligible for the automation itself.  In case when one has to achieve 100% test automation coverage, he/she may claim – “It is possible to automate every bit of the program functionality!” – fine by me, but is it reasonable?  Is it worth spending hours of work to produce an automated test that will be run only once or twice, but can actually be tested manually in just a few minutes?  Don’t expect a unanimous answer, but in majority of such cases - resources spent are not worth results provided.  Another point that can be hard to overcome during the automation tests implementation is the maintenance of test cases.  This problem mostly affects products that tend to change frequently during the development process or during the product lifecycle. 

4/09/2013

Speeding up MR Image Reconstruction with GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA)


This time we would like to share some details about realization of one of our projects in medical imaging.
Currently one of the bottlenecks in MR image reconstruction is speed improvement. Improving the speed of image reconstruction is difficult from algorithmic point of view. But it’s becoming more popular to improve algorithm performance using GPU.

Introduction

In magnetic resonance (MR) image reconstruction raw data measured from the scanner correspond to the Fourier coefficients of the target image, so the fast Fourier transform (FFT) is used for reconstruction. An important procedure called the density compensation is necessary in the very beginning to account for the non-uniform sampling. GeneRalized Autocalibrating Partially Parallel Acquisitions (GRAPPA) is a partially parallel acquisition (PPA) method which is used to reduce scan times. In this case only partial data is acquired and missing data is mathematically calculated from available data.

Subject for optimization

Original implementation was based on FFTW library for Fast Fourier transforms and adapted Singular Value Decomposition (SVD) algorithm from ACM Collected Algorithms for GRAPPA preprocessing. These two algorithms are the most computationally intensive parts of the whole image reconstruction. FFTW library is claimed to be the fastest CPU implementation using all possible optimizations like SSE2/3 and hyper threading, however it does not leverages the power of modern GPU cards. SVD algorithm was done on CPU as well. It is known to be badly parallelizable for small matrices, but in case of GRAPPA algorithm we have many image frames with same size which can be processed in parallel. Besides there are many intermediate steps which consume a lot of CPU and they can be easily parallelized on GPU.

Technical analysis

FFTW library performance is comparable with Intel MKL implementation. NVidia provides comparison for their CUDA based cuFFT library with MKL implementation (Figure 1):

Figure 1 Comparison of CUDA based cuFFT library with MKL implementation
According to this we should achieve up to 8-10x faster FFT processing  when using GPU accelerated cuFFT library from Nvidia. GPU accelerated SVD algorithm is also available, for example CULA library by EM Photonics. However, current CULA library implementation does not support batch mode, so we will need to process all image frames as a sequence. Brief testing showed that 64 image frames (256*256) are processed even slower than CPU based version. Since we haven’t found any good alternative to CULA library we decided to implement our own GPU accelerated version of SVD algorithm.

Implementation

FFT part of image reconstruction when using cuFFT library was straightforward, however we had to deal with image frames which does not fit into available GPU memory. We had to write algorithm to run FFT over portions of the large data frame with subsequent aggregation. Figure 2.1 below shows case when all data fits into GPU memory.
Figure 2.1 
Figure 2.2 illustrates the case when huge data is processed. Solid lines in figure below show measured performance, dashed lines show estimated time in case data fits into GPU memory.
Figure 2.2
Much more interesting was to implement GPU accelerated SVD algorithm with batch mode. All implementations we had found are focusing on maximum parallelization of a single SVD run, hence we had to change approach. Basically SVD algorithm consists of HouseHolder Reduction, QR Diagonalization and Back Transformation steps. All are iterative processes when next step depends on results from previous step. In case of small matrices each CUDA kernel can’t effectively utilize all parallel processing units of modern GPU. So we had to write kernels in a way when every iteration for all matrices is processed by a single kernel run. This way in case of 64 matrices with 128x128 size each we can process 64*128 elements at a time instead of 128. Figures 3.1 and 3.2 show performance comparison for CULA Library and our implementation. 
Figure 3.1

Figure 3.2
With more than 8 frames per batch our implementation shows much better performance comparing to sequential CULA calls, although it is not so efficient for a single frame.

Results

As a result we have developed a pure C++ library with a set of specialized functions which perform various stages of image reconstruction. It requires only CUDA runtime libraries and free cuFFT library provided by NVidia. In addition we have implemented lightweight C# wrapper for convenient usage. Also we have run a lot of benchmarks with various GPU cards and on different platforms. On test cases provided by customer we received up to 150x speedup comparing to original single-threaded CPU implementation. However significant part of received speedup was due to poorly optimized original code which was completely rewritten and ported to CUDA whenever possible. 
While it is usually understood what FFT stage does in image reconstruction, GRAPPA stage is not so obvious. Due to parallel acquisition of different frames arises distortion of acquired data which is effectively eliminated. Figure 4 shows visual representation of images before and after reconstruction. 

Figure 4 The image before the reconstruction (left), image after reconstruction (right)
Additionally, you can find a case-study on ELEKS website or download it in PDF. Stay tuned!
/by Volodymyr Kozak, Principal Developer, ELEKS R&D

ELEKS open-source projects directory

Even though ELEKS is a software development services company and we mainly develop proprietary software for our customers, we have our own open-source projects and contribute to projects maintained by other people and organizations. We are proud to be a part of the open-source community. Recently we have launched GitHub page where you can find a list of our projects and projects we contribute to: http://eleks.github.io/.
We are going to publish more open-source projects soon. Stay tuned!

4/04/2013

20x performance for the same HW price: GPGPU-based Machine Learning at Yandex

Russian search-engine Yandex has disclosed some details about their machine-learning framework, FML. The most interesting detail is that it runs on 80 TFLOPS cluster powered by GPGPU. This is quite unusual application for GPU, as ML algorithms are usually hard to be paralleled. However they have managed to adapt their decision tree algorithm for high-level of parallelism. As a result Yandex has achieved more than 20x speed-up for the same hardware price.
They are going to upgrade their cluster to 300 TFLOPS. Yandex expects its cluster to be in the list of top 100 most powerful supercomputers in the world after that upgrade.

4/01/2013

The end of the Javascript domination

The world is changed,
I feel it in the water,
I feel it in the earth.
I smell it in the air
The Lord of the Rings: The Fellowship of the Ring


Mozilla's ASM.js release last week created serious buzz in the programming community. Some people are excited by it, some criticize it, but almost everybody agrees that it may have significant impact on the future of the web.
I consider it as one more nail in monolanguage web's coffin. Today there are more than 200 languages and tools related to Javascript generation and this number is increasing every week. Of course, most of them are no more than amateur projects that are designed to solve some kind of Javascript problem. However, there is a clear trend. The reason of this trend is obvious: complexity of web client-side is increasing and Javascript is not always ready to support this complexity. Most of these tools have emerged over the last few years. Here are few of them I consider to be most important:
Language/Tool     Year of appearance
CoffeeScript2009
Dart2011
TypeScript2012
Emscripten2012
ASM.js2013

What is really important is that three major browser vendors seems to be supporters of this movement: Google (with its Dart), Microsoft (TypeScript) and Mozilla (ASM.js). Together they hold 90% of browser market share and definitely have enough power and influence to change client-side programming landscape over the next few years.
I don't mean that Dart, TypeScript or some other language will replace Javascript completely. It will always be here, but it is hard to argue with the fact that Javascript is no longer the only client-side programming language in the web. It still dominates, but things are going to change over the next few years.
Any thoughts?

3/21/2013

Game Localization: For Gamers by Gamers?


While revising our approach to games localization, we decided to ask for an opinion of people, who were not once observed spending their free evenings in virtual worlds loudly shouting at each other totally in oblivion of the existence of someone else finishing their job. Well, what we received back was the overview of nuances the gamers care about with the culminating quote: “For gamers by gamers”, meaning the gamers know better how and what to localize. (I subconsciously thought of doctors and their reaction to patient’s saying “I know better which medicine to take in my case”). But the guy might be right to some extent. This won’t be a translator who will consume the final product. To tell you the truth, when working as a translator I never came back to the work already done long time ago to reevaluate my performance. Life was always too short. Job's done, no feedback to implement, money paid – adieu! I did not care anymore.

But would it mean that the guy living more in his virtual life than in reality would do the localization better than me? I did spend 5 and half years in the university studying  languages. I did spend nights in my student years doing freelance translation.

The truth is in the middle, as always. At least I tend to think so. In the ideal world I would combine the guy’s passion for meaningful game dialogs and my passion to do this a right way. Thanks God, we live in the 21st century. Obviously I am speaking about crowdsourcing, precisely about controlled crowdsourcing. Controlled in a sense that there is a dedicated person observing the deadlines, the progress and quality. The person who can step in choosing more proper option, who will ensure there are no grammar/style mistakes and finally who take the responsibility for the target translation quality.  This still allows the game developers to engage the community into the creation of their local product version. This still makes the localization cheaper. On the other hand, the process is being watched and in case of unpleasant or pleasant surprises, the producer is aware and is advised on the actions to take. I would feel safer if I were a game developer.

The traditional approach of engaging professional translators into games localization is still in place. And of course there are translators specialized precisely in the area of games, who are native in the target languages (we have been cooperating with many such guys, very proficient). This approach guarantees the quality and consistency but is more expensive. I would advise traditional approach for game developers willing to do localization, but who do not have their community yet or the establishment of such community is at inception stage.

by Katia Kosovan, Department Manager
Localization and QA unit