GSoC - GPU Assisted Video Decoding

GPU Hardware Accelerated Video Decoding of H.264 (MPEG-4 part-10, a.k.a. MPEG-4 AVC) encoded video.


 * Proposed by student: Rudd
 * Proposed primary mentor: D4rk
 * Proposed backup mentor: Elupus


 * XBMC community forum discussion threads:
 * GPU assisted H.264 Decoding (GSoC Project 2008)

= GSoC proposal for 2008 = This is a discussion about a proposal for XBMC's Google Summer of Code 2008

Name
Robert Rudd (IRC & XBMC.org username: Rudd)

E-Mail
ruddrob@gmail.com

Project Title
GPU Hardware Accelerated Video Decoding of MPEG-4 encoded video

Benefits

 * 1) This project will improve the end-user experience tremendously. XBMC is a prime choice for many users who have an HTPC. These Machines traditionally don't have excessive amounts of processing power. Any computations that could be offloaded to unused hardware would be ideal.
 * 2) Exploring new territory: Much talk has been given to the use of GPUs for general purpose computation. An actual implementation of a video decoder in a non-GPU specific manner would be quite the achievement!

Project Abstract
It is not atypical for a modern computer to have 2 primary processing units: a Central Processing Unit (CPU) which is capable on the order of 1 GFLOPS, and a Graphics Processing Unit(GPU) which is capable on the order of 100 GFLOPS. Despite this, when playing a high definition H.264 video, which requires on the order of 10's of millions of operations per second, only the CPU is used. Even accounting for the fact that the GPU is highly optimized for floating point operations and the H.264 codec strives to avoid floating point operations, it is still quite obvious there is a large amount of processing power that is being left unused.

I propose that hardware assisted H.264 decoding be integrated into XBMC. While, I plan on spending some of the time leading up to this summer researching the ideal architecture to use when implementing this, I am currently leaning towards an OpenGL Shader Language (GLSL) implementation to keep it as portable and vendor-neutral as possible.

Detailed Description
This proposal outlines an implementation of hardware H.264 decoding for the DVDPlayer (XBMC's own in-house developed video-player based on FFmpeg). This project can be broken into two primary components; architecture research and implementation.

Architecture Research
While, it pains me to include something as nebulous as "research" into my project proposal, I do feel that the time spent researching alternatives is necessary and will be well spent. Using a GPU for general purpose computing (the GPGPU movement) is a rather new field of inquiry. There are very few accepted standards, with each vendor proposing their own solutions. Thus far possible initiatives for GPU accelerated video coding/decoding include: Intel's VAAPI, Nvidia's CUDA/FreeVideo, and ATI’s Avivo/CTM. This is in addition to any video capabilities within the DirectX and OpenGL APIs and several GPGPU libraries such as Brook and lib sh.

At the time of this writing, I am strongly leaning towards an implementation done in OpenGL Shader Language (GLSL). While, unlike many of the other architectures, GLSL was not made with the intent purpose of allowing a GPU to perform general purpose calculations, it has the advantages of providing enough functionality to do video decoding calculations, while being part of an API that is almost universally supported by GPUs and Operating Systems. Since XBMC strives to be as portable as possible, I feel that GLSL is the most advantageous. However, I will definitely not overlook other architectures. Specifically the GPGPU libraries look very promising.

Implementation
While the above research is not to be underestimated, the core of the project will be the implementation of hardware acceleration in whatever architecture I choose. Over the course of the summer, I hope to offload as much of the H.264 processing workload as I can onto the GPU. I will concentrate on the following 2 areas:

Motion Compensation
Motion Compensation is one of the most computationaly intensive portions of the H.264 decoding process. Due to the fact that H.264 supports up to quarter-pixel interpolation there may be an enormous amount of computation for a single frame. Additionally, since each macroblock is not only allowed to point to a any point on a given reference picture, but can also point to different reference picture, this can cause havoc on the system's cache.

However, the motion compensation is parallelizable across macroblocks. Each macroblock can be interpolated separately of other macroblocks, presenting a problem that is easily mapped onto the GPU.

The approach that will be taken will most likely be one of the 2 presented in the following papers:


 * Performance evaluation of H.264/AVC decoding and visualization using the GPU

Outlines a 6 pass strategy - 1 for Fullpel, 2 for Halfpel, 3 for Quarterpel. A quad is rendered for each block/subblock in the frame. the motion vector is repeated at each of it's vertices, and the vertex shader translates the texture coordinates apropriately. The vertex shader also handles pushing irrelevant blocks out of hte viewing frustum to reduce computation. Afterwards a fragment shader is used to perform the actual interpolation


 * Real-time high definition H.264 video decode using the Xbox 360 GPU

Outlines a 3 pass strategy - 1 for fullpel, 1 for off-center interpolations, 1 for center interpolations. Similar approach as above, but batches the blocks differently.

Inverse Transformation and Reconstruction
The H.264 codec uses a frequency transform similar to the IDCT, but based on integer math to reduce spatial redundancy. The decoder must perform an inverse transform to recover the residual frame - the frame containing the error between the actual frame and the motion compensated frame. This frame is added to the motion compensated frame to get the final frame to be displayed.

FFMPEG makes heavy use of SIMD operations and is relatively fast when it comes to inverse transformation, however, this portion will also ideally need to be moved on to the GPU, if only to reduce GPU<->CPU transfers.

Deadlines and Deliverables
This is only a very tentative outline. Any given module may be completed in a longer or shorter period. It simply serves as a guideline for what I hope to achieve and the order I plan to set about doing it. At the very least I am hoping I can get the two primary modules (motion compensation and IDCT) implemented and integrated with the SVN repository over the summer.


 * 1) Architecture Research - 2 Weeks- Researching and choosing the architecture of the hardware decoder. At the end of this period I will have code running from within XBMC that uses the given architecture.
 * 2) H.264 decoder Skeleton – 1 Week – A skeleton of my H.264 decoder will be written. Will be compromised of mostly the software implementation currently within XBMC (which I believe is handled by FFmpeg at the moment), and will be replaced with the hardware components as they are written.
 * 3) Motion Compensation - 3 weeks - Implementation of Motion Compensation using the chosen architecture
 * 4) Inverse Transformation - 3 Weeks - Implementation of inverse Transformation . Success will be judged by successfully lowering the amount of CPU cycles required to decode a given H.264 video over the initial software implementation
 * 5) Testing and Final Commit – 2 week - - Extensive testing as well as modifying the code to fit within the XBMC repository. Any left over time will be spent researching other areas to accelerate with hardware.

Personal Statement
I am currently a 21 year old senior at the Massachusetts Institute of Technology majoring in Computer Science and Engineering. In the fall of 2008, I will enter an MIT Master’s program in Electrical Engineering and Computer Science. Systems programming and signal processing have always been interests of mine. I feel that this project will offer me a unique chance to blend my interests in low level systems programming with my passion for signal processing.

I have a lot of experience coding at all levels of complexity. From creation of Java Chess software to the implementation of an Operating System on the level of Unix-version 6 in C and x86 assembly. Working on this project will allow me to grow as a programmer as well as contribute to excellent open source software.