Video Coding Walkthrough
This example shows how a video frame is encoded (compressed) and decoded (decompressed). The frame is processed using a simple motion compensated transform encoder, similar to an early MPEG codec. However the basic principles apply to more sophisticated codecs such as H.264, VP8, etc.
The original frame is taken from a CIF video sequence, which means that the luma or luminance component shown in Fig. 1 is 352 samples wide and 288 samples high. The red/blue chroma components, not shown, are each 176x144 samples.
Fig. 2 shows the reconstructed previous frame in the video sequence. This has been encoded and decoded (=reconstructed) and some distortion can be seen. The diﬀerence between Fig 1 and Fig 2 without motion compensation is shown in Fig. 3. There is clearly still significant energy in Fig. 3, especially around moving areas.
Fig 1 - Input frame
Fig 2 - Reconstructed reference frame
Fig 3 - Residual: no motion compensation
Motion estimation is carried out with a 16x16 block size and half-pixel accuracy, producing the set of vectors shown in Fig. 4, superimposed on the current frame for clarity. Many of the vectors are zero and are shown as dots, which means that the best match for the macroblock is in the same position in the reference frame. Around moving areas, the vectors point in the direction that blocks have moved from. The man standing on the left of the picture is walking to the left; the vectors therefore point to the right, i.e. where he has come from. Some of the vectors do not appear to correspond to “real” movement, for example those on the surface of the table, but indicate simply that the best match is not at the same position in the reference frame. “Noisy” vectors like these often occur in regions of the picture where there are no clear object features in the reference frame.
Fig 4 - 16x16 motion vectors superimposed on frame
Fig. 5 shows the motion compensated reference frame, i.e. the reference frame with each block shifted according to the motion vectors. The walking person has been moved to the left to provide a better match for the same person in the current frame and the hand of the left-most person has been moved down to provide an improved match. Subtracting the motion compensated reference frame from the current frame gives the motion-compensated residual (Fig. 6). Compared with Fig. 3, the energy has clearly been reduced, particularly around moving areas.
Fig 5 - Motion compensated reference frame
Fig 6 - Motion compensated residual frame
Fig. 7 shows a macroblock from the original frame, taken from around the head of the figure on the right. Fig. 8 is the luma component of the same macroblock after motion compensation. This is known as the residual.
Fig 7 - Original macroblock: luminance
Fig 8 - Residual macroblock: luminance
Each block within the residual macroblock is transformed using a Discrete Cosine Transform (DCT). Applying a two-dimensional DCT to the top-right 8x8 block of luminance samples (Table 1) produces the DCT coeﬃcients listed in Table 2. The magnitude of each coeﬃcient is shown in Fig. 9. Note that the larger coeﬃcients are clustered around the top-left or DC coeﬃcient.
Fig 9 - DCT coefficient magnitudes: top-right 8x8 block
Table 1 - Residual luminance samples : top-right 8x8 block
Table 2 - DCT coefficients
A forward quantizer is applied:
Qcoeff = round(coeff/Qstep)
where Qstep is the quantizer step size, Qstep=12 in this example. The output of the quantizer is shown in Table 3. Small-valued coefficients become zero in the quantized block and the non-zero outputs are clustered around the top-left (DC) coefficient.
Table 3 - Quantized coefficients
The quantized block is re-ordered in a zigzag scan starting at the top-left to produce the following list of coefficients:
-1, 2, 1, -1, -1, 2, 0, -1, 1, -1, 2, -1, -1, 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 1, 0,… (all zeros after this)
This array is processed to produce a series of (run, level) pairs:
The first number in each pair indicates the number of preceding zeros and the second number indicates the non-zero “level”. “EOB” (End Of Block) indicates that the remainder of the coefficients are zero.
Each (run, level) pair is encoded as a variable length code (VLC). Using the variable length code tables from MPEG-4 Visual, the VLCs shown in Table 4 are produced.
Table 4 - Variable length coding example
The final VLC signals that LAST=1, indicating that this is the end of the block. The motion vector for this macroblock is (0, 1), i.e. the vector points downwards. The motion vector X and Y components are (differentially) coded as (1) and (0010) respectively.
The macroblock is transmitted as a series of VLCs, including a macroblock header, coded motion vector and transform coeﬃcients for each 8x8 block.
At the decoder, the VLC sequence is decoded to extract header parameters, coded motion vectors and (run,level) pairs for each block. The 64-element array of coeﬃcients is reconstructed by inserting (run) zeros before every (level). The array is then ordered to produce an 8x8 block identical to Table 3. The quantized coeﬃcients are rescaled using:
Rcoeﬀ = Qstep.Qcoeﬀ
Rescaling or inverse quantization produces the block of coeﬃcients shown in Table 5. This block is significantly diﬀerent from the original DCT coeﬃcients (Table 2) due to the quantization process. An Inverse DCT is applied to create a decoded residual block (Table 6) which is similar but not identical to the original residual block (Table 1). Comparing the original and decoded residual blocks in Fig. 10, it is clear that the decoded block has less high-frequency variation because of the loss of high-frequency DCT coeﬃcients through quantization.
Table 5 - Rescaled coeﬃcients
Table 6 - Decoded residual luminance samples
Fig 10 - Original and decoded residual blocks
The decoder recreates the original motion vector (0, 1). Using this vector, together with its own copy of the previously decoded frame (Fig. 2), the decoder reconstructs the macroblock. The complete decoded frame is shown in Fig. 11. Because of the quantization process, some distortion has been introduced, for example around detailed areas such as the faces and the writing on the board and there are some obvious edges along 8x8 block boundaries. The complete sequence was compressed by around 300 times, i.e. the coded sequence occupies less than 1/300 the size of the uncompressed video. Significant compression is achieved at the expense of relatively poor image quality.
Fig 11 - Decoded Frame
Iain E Richardson, “The H.264 Advanced Video Compression Standard”, John Wiley & Sons, 2010.
About the author
Vcodex is led by Professor Iain Richardson, an internationally known expert on the MPEG and H.264 video compression standards. Based in Aberdeen, Scotland, he frequently travels to the US and Europe.
Professor Richardson is the author of “The H.264 Advanced Video Compression Standard”, a widely cited work in the research literature. He has written three further books and over 50 journal and conference papers on image and video compression. He regularly advises companies on video codec technology, video coding patents and mergers/acquisitions in the video coding industry. Professor Richardson leads an internationally renowned image and video coding research team, contributes to the MPEG industry standards group and is sought after as an expert witness and litigation consultant.