The Mostly Color Channel: The performance of JPEG implementations

Wednesday, May 5, 2010

The performance of JPEG implementations

1. DCT ALGORITHMS

JPEG compression is a method consisting of the following steps:

Transform the color coordinates to an opponent color system
Subsample the chromatic coordinates, for which the human visual system's (HVS) MTF is about half
Decorrelation: perform a discrete cosine transformation (DCT) to de-correlate the spatial
information
Quantization: truncate the coordinates to exclude spatial information invisible to the HVS
Entropy coding: compress the data using run-length encoding followed by Huffman encoding

JPEG goes back to when PCs had 640K bytes of memory and processors had a clock rate
of a few MHz. Since the computational bottleneck is the DCT, a considerable effort went
into optimizing it. The 2-dimensional DCT requires 4,096 multiply-accumulate operations. The
straightforward optimization is to exploit the DCT's separability and compute two 1-dimensional
DCTs for a total of 1,024 multiply-accumulate operations and 896 additions.

The key strategy to optimize the DCT is to express it in vector-matrix form, especially for
optimizing it for array computation as is done in CUDA on a GPU [6]. This leads to a number of
fast algorithms, which trade some arithmetic precision for less operations; for the details refer
to [2, Section 3.5]. Today the algorithm by Arai, Agui, and Nakajima [1] is considered to be the
best for computation on CPU cores; its complexity is 464 additions and 80 multiplications.

2. JPEG IMPLEMENTATIONS

In 1991 Thomas G. Lane implemented most of the JPEG standard in a free library by the Independent JPEG Group and is generally known as IJG Library or libjpeg. The most recent version is 6, released 27 March 1998. It contains several DCT implementations, all based on AA&N's algorithm. IJG's aim was to be very efficient, portable, and maintainable.

IJG is generally considered to be the fastest portable CPU implementation around. According to an article by Mark Nelson in Dr. Dobb's Journal, it is twice as fast as Microsoft's implementation. There is also agreement that anybody wishing to study a JPEG implementation should study the IJG library because it is readable (see e.g. the web site of Prof. Norishige Fukushima [3] at Nagoya University). In fact, it appears that most if not all implementations are derived from the IJG library.

More recently, a SIMD version of the IJG has become available. Various comments and postings on the Internet suggest this is a drop-in replacement for the IJG library. However, is is not very readable and it is generally suggested to use IJG first and when one's code is debugged and finished to substitute in the SIMD IJG.

Intel sells a $199.00 library called Integrated Performance Primitives (IPP). Based on this library there are two JPEG implementations, both claiming to be drop-in compatible with IJG: an example by Intel on how to use IPP and the open source project turboJPEG.

nVIDIA has only an example of the inverse DCT, not a full JPEG implementation on CUDA. As late as June 30, 2009 Christian Buchner posted on an nVIDIA forum that it is not known if JPEG can be decoded on a GPU. A bid request on RentACoder for a CUDA based JPEG resizer was cancelled due to time-out.

The student Ramazan Dinçer has translated the JPEG decoder in the IJG library to CUDA, but his implementation also requires an old deprecated version of IPP and an old version of DirectX. As far as I know, this is the only open source code available.

3. PERFORMANCE

Performance comparison between conventional IJG, the SIMD version, and the Intel example based on IPP. The test image is Lenna at 512x51$ pixels and the processor is an Intel Core2Duo 2.8 GHz and 5G bytes of RAM. Credit: Norishige Fukushima.

The figure above is from Fukushima's web site. As can be seen, the SIMD version of IJG is twice as fast as the conventional implementation. The only reason not to use it is code maintainability or when the performance is not needed and one needs to use some of the CPU cores for other tasks.

The IPP implementation is more than 3 times faster than IJG, why would it not be used exclusively? The reason is that IPP uses all cores at 100% performance, which can cause thermal problems when the function is used for extended periods. Also, the computer cannot be used for anything else, as IPP completely takes over all resources.

It is not possible to implement the full JPEG algorithm on a GPU in CUDA. Generally, three of the steps (see Sec. 1) can be executed on a GPU: color transformation and subsampling in one kernel and DCT in a second kernel. However, due to a contention issue accessing the local memory, a straight CUDA implementation of the IJG library code is slower than when executed on the CPU [6]. Obukov et al. suggest a completely different implementation, which on my PC (HP xw4600 with GeForce 8800 GTS) decreases the execution time from 0.44 ms to 0.14 ms on the Barbara target image.

4. WHY BOTHER WITH THE GPU?

Stefan Lietsch [4] found a performance bug in IJG, namely that is does not scale. When the image is about 1680x1050 pixels or larger, the performance suddenly drops considerably and there is large variation depending on the spatial image contents [5, p. 665]. Liestch and Lensing have not been able to find the problem, but speculate it might be caused by memory or cache limitations.

Since they were not able to use turboJPEG due to its saturating the CPU, they decided to bite the bullet and port IJG to CUDA. This can be done cleanly because IJG uses a registration mechanism for the subroutines implementing each step and the wrapper for the CUDA kernels can be made to have the same interface as the IJG subroutines, although the number and sequence of steps are different.

Implementing the forward JPEG transformation took six months of Lensing's time and five months of Lietsch's time (see here for the details). The figure below compares the performance of their CUDA implementation with that of the IJG library and the IPP-based implementation. The CUDA version is about twice as fast as the IJG version, but is only half as fast as the IPP version. However, there is a large error bar in the IJG implementation reflecting the high variability due to the problem described at the beginning of this section.

Performance comparison compressing a 1680x1050 pixel image using the IJG library, its CUDA implementation, and Intel's IPP library. Credit: Stefan Lietsch

The CUDA implementation is just a straight port of the IJG library. As noted in Sect. 3, the inverse DCT can be made three times as fast. Since Lietsch and Lensing do not itemize their performance data, it is not possible to estimate the impact achievable on the entire decompression of an image when using the faster inverse DCT. Lietsch notes [4, p. 98] that Huffman decoding is already fast on the CPU and the quantization step is not present, so parallel decompression does not achieve a significant performance increase, other than freeing up cycles on the CPU.

REFERENCES

Yukihiro Arai, Takeshi Agui, and Masayuki Nakajima, A fast DCT-SQ scheme for images, IEICE Transactions E71–E (1988), no. 11, 1095–1097
Vasudev Bhaskaran and Konstantin Konstantinides, Image and video compression standards, second ed., Kluwer Academic Publishers, Norwell (MA), 1997
Norishige Fukushima, Encoding and decoding JPEG files in main memory, Web site, 2009, http: //nma.web.nitech.ac.jp/fukushima/opencv/jpegio.html
Stefan Lietsch, A novel approach to interactive, distributed visualization and simulation on hybrid cluster systems, Ph.D. thesis, Universität Paderborn, Fakultät für Elektrotechnik, Informatik und Mathematik, Institut für Informatik, July 2008
Stefan Lietsch and Paul Hermann Lensing, GPU-supported image compression for remote visualization — realization and benchmarking, ISVC '08: Proceedings of the 4th International Symposium on Advances in Visual Computing, Springer-Verlag, Las Vegas, NV, 2008, pp. 658–668
Anton Obukhov and Alexander Kharlamov, Discrete cosine transform for 8x8 blocks with CUDA, White paper, nVIDIA, October 2008

4 comments:

Neil Gunther06 May, 2010 18:52
With all those GPU cycles, why DCT and not FFT?
ReplyDelete
Replies
Anonymous11 November, 2010 06:25
Because: ISO/IEC 10918-3
ReplyDelete
Replies
DRC05 March, 2011 04:00
Interesting paper. I was the creator of TurboJPEG/IPP and am currently maintaining libjpeg-turbo, which builds upon the libjpeg/SIMD code developed by Miyasaka Masaru (I presume that is the "IJG SIMD" implementation you refer to above.)

I would be interested to see how the current libjpeg-turbo compares. Relative to libjpeg/SIMD, we added SSE2 instructions and improved Huffman performance (especially on 64-bit platforms, where the algorithms can take advantage of the increased word size.)
ReplyDelete
Replies
Anonymous30 December, 2011 22:40
Fast CUDA JPEG encoder
http://www.fastvideo.ru/english/products/software/cuda-jpeg-encoder.htm
ReplyDelete
Replies

Add comment

About this blog

The Internet is an amalgam of forms blurred under epistemological pressures. In Søren Kierkegaard’s words, under this flat shower of leveled information, where everybody is interested in everything and nothing is too trivial or too important, people just accumulate information and postpone decisions indefinitely, i.e., nobody takes action and nobody is responsible for truth — there is no mastery, just gossip. He called this the æsthetic sphere of existence, exhorting us to evolve to the ethical sphere, where we do not just accumulate information but take action and make commitments. Blogs are instruments to overcome flatness by creating opportunities for vertical activities. In this sense this blog is a view from my window — a collection of tidbits I judged relevant to computational color science and in general to the promotion of scientific excellence in areas of strategic importance for the future of research, economy and society.

The Mostly Color Channel

Wednesday, May 5, 2010

The performance of JPEG implementations

1. DCT ALGORITHMS

2. JPEG IMPLEMENTATIONS

3. PERFORMANCE

4. WHY BOTHER WITH THE GPU?

REFERENCES

4 comments:

Search This Blog

Featured Post

Meta-Palette

Understanding Color

Cognitive Aspects of Color

The Color Thesaurus...

Popular Posts

Blog Archive

Labels

Contributors

Blogroll

About this blog

Privacy Policy