Wednesday, May 5, 2010

The performance of JPEG implementations

1. DCT ALGORITHMS

JPEG compression is a method consisting of the following steps:

  1. Transform the color coordinates to an opponent color system
  2. Subsample the chromatic coordinates, for which the human visual system's (HVS) MTF is about half
  3. Decorrelation: perform a discrete cosine transformation (DCT) to de-correlate the spatial
    information
  4. Quantization: truncate the coordinates to exclude spatial information invisible to the HVS
  5. Entropy coding: compress the data using run-length encoding followed by Huffman encoding

JPEG goes back to when PCs had 640K bytes of memory and processors had a clock rate
of a few MHz. Since the computational bottleneck is the DCT, a considerable effort went
into optimizing it. The 2-dimensional DCT requires 4,096 multiply-accumulate operations. The
straightforward optimization is to exploit the DCT's separability and compute two 1-dimensional
DCTs for a total of 1,024 multiply-accumulate operations and 896 additions.

The key strategy to optimize the DCT is to express it in vector-matrix form, especially for
optimizing it for array computation as is done in CUDA on a GPU [6]. This leads to a number of
fast algorithms, which trade some arithmetic precision for less operations; for the details refer
to [2, Section 3.5]. Today the algorithm by Arai, Agui, and Nakajima [1] is considered to be the
best for computation on CPU cores; its complexity is 464 additions and 80 multiplications.

2. JPEG IMPLEMENTATIONS

In 1991 Thomas G. Lane implemented most of the JPEG standard in a free library by the Independent JPEG Group and is generally known as IJG Library or libjpeg. The most recent version is 6, released 27 March 1998. It contains several DCT implementations, all based on AA&N's algorithm. IJG's aim was to be very efficient, portable, and maintainable.

IJG is generally considered to be the fastest portable CPU implementation around. According to an article by Mark Nelson in Dr. Dobb's Journal, it is twice as fast as Microsoft's implementation. There is also agreement that anybody wishing to study a JPEG implementation should study the IJG library because it is readable (see e.g. the web site of Prof. Norishige Fukushima [3] at Nagoya University). In fact, it appears that most if not all implementations are derived from the IJG library.

More recently, a SIMD version of the IJG has become available. Various comments and postings on the Internet suggest this is a drop-in replacement for the IJG library. However, is is not very readable and it is generally suggested to use IJG first and when one's code is debugged and finished to substitute in the SIMD IJG.

Intel sells a $199.00 library called Integrated Performance Primitives (IPP). Based on this library there are two JPEG implementations, both claiming to be drop-in compatible with IJG: an example by Intel on how to use IPP and the open source project turboJPEG.

nVIDIA has only an example of the inverse DCT, not a full JPEG implementation on CUDA. As late as June 30, 2009 Christian Buchner posted on an nVIDIA forum that it is not known if JPEG can be decoded on a GPU. A bid request on RentACoder for a CUDA based JPEG resizer was cancelled due to time-out.

The student Ramazan Dinçer has translated the JPEG decoder in the IJG library to CUDA, but his implementation also requires an old deprecated version of IPP and an old version of DirectX. As far as I know, this is the only open source code available.

3. PERFORMANCE

Performance comparison between conventional IJG, the SIMD version, and the Intel example based on IPP. The test image is Lenna at 512x51$ pixels and the processor is an Intel Core2Duo 2.8 GHz and 5G bytes of RAM. Credit: Norishige Fukushima.

The figure above is from Fukushima's web site. As can be seen, the SIMD version of IJG is twice as fast as the conventional implementation. The only reason not to use it is code maintainability or when the performance is not needed and one needs to use some of the CPU cores for other tasks.

The IPP implementation is more than 3 times faster than IJG, why would it not be used exclusively? The reason is that IPP uses all cores at 100% performance, which can cause thermal problems when the function is used for extended periods. Also, the computer cannot be used for anything else, as IPP completely takes over all resources.

It is not possible to implement the full JPEG algorithm on a GPU in CUDA. Generally, three of the steps (see Sec. 1) can be executed on a GPU: color transformation and subsampling in one kernel and DCT in a second kernel. However, due to a contention issue accessing the local memory, a straight CUDA implementation of the IJG library code is slower than when executed on the CPU [6]. Obukov et al. suggest a completely different implementation, which on my PC (HP xw4600 with GeForce 8800 GTS) decreases the execution time from 0.44 ms to 0.14 ms on the Barbara target image.

4. WHY BOTHER WITH THE GPU?

Stefan Lietsch [4] found a performance bug in IJG, namely that is does not scale. When the image is about 1680x1050 pixels or larger, the performance suddenly drops considerably and there is large variation depending on the spatial image contents [5, p. 665]. Liestch and Lensing have not been able to find the problem, but speculate it might be caused by memory or cache limitations.

Since they were not able to use turboJPEG due to its saturating the CPU, they decided to bite the bullet and port IJG to CUDA. This can be done cleanly because IJG uses a registration mechanism for the subroutines implementing each step and the wrapper for the CUDA kernels can be made to have the same interface as the IJG subroutines, although the number and sequence of steps are different.

Implementing the forward JPEG transformation took six months of Lensing's time and five months of Lietsch's time (see here for the details). The figure below compares the performance of their CUDA implementation with that of the IJG library and the IPP-based implementation. The CUDA version is about twice as fast as the IJG version, but is only half as fast as the IPP version. However, there is a large error bar in the IJG implementation reflecting the high variability due to the problem described at the beginning of this section.

Performance comparison compressing a 1680x1050 pixel image using the IJG library, its CUDA implementation, and Intel's IPP library. Credit: Stefan Lietsch

The CUDA implementation is just a straight port of the IJG library. As noted in Sect. 3, the inverse DCT can be made three times as fast. Since Lietsch and Lensing do not itemize their performance data, it is not possible to estimate the impact achievable on the entire decompression of an image when using the faster inverse DCT. Lietsch notes [4, p. 98] that Huffman decoding is already fast on the CPU and the quantization step is not present, so parallel decompression does not achieve a significant performance increase, other than freeing up cycles on the CPU.

REFERENCES

  1. Yukihiro Arai, Takeshi Agui, and Masayuki Nakajima, A fast DCT-SQ scheme for images, IEICE Transactions E71–E (1988), no. 11, 1095–1097
  2. Vasudev Bhaskaran and Konstantin Konstantinides, Image and video compression standards, second ed., Kluwer Academic Publishers, Norwell (MA), 1997
  3. Norishige Fukushima, Encoding and decoding JPEG files in main memory, Web site, 2009, http: //nma.web.nitech.ac.jp/fukushima/opencv/jpegio.html
  4. Stefan Lietsch, A novel approach to interactive, distributed visualization and simulation on hybrid cluster systems, Ph.D. thesis, Universität Paderborn, Fakultät für Elektrotechnik, Informatik und Mathematik, Institut für Informatik, July 2008
  5. Stefan Lietsch and Paul Hermann Lensing, GPU-supported image compression for remote visualization — realization and benchmarking, ISVC '08: Proceedings of the 4th International Symposium on Advances in Visual Computing, Springer-Verlag, Las Vegas, NV, 2008, pp. 658–668
  6. Anton Obukhov and Alexander Kharlamov, Discrete cosine transform for 8x8 blocks with CUDA, White paper, nVIDIA, October 2008