[有改动、删减]Windows下的x264多线程分析

来源:互联网 发布:林宥嘉 知乎 编辑:程序博客网 时间:2024/06/12 01:21

1. 编译并行编码的x264

x264可以通过添加--threads n (这里n≥2)项可以调整运行的线程数,即进行并行编码。可是,一开始在windows下使用VS编译器时,添加--threads n 并没有进行真正的多线程编码,运行窗口中会提示x264 [ warning ] : not compiled with pthread support !。如下图所示。这是因为x264使用的线程是pthread,而windows是不支持pthread的,所以需要POSIX Threads (pthreads) for Win32。

(1 ) 从http://sourceware.org/pthreads-win32/ 下载pthread的win32版本,把其中的include和lib加入到VC++的引用目录中去。

(2 ) 在项目属性的“C/C++ -> 预处理器 ->预处理器”中加入HAVE_PTHREAD,SYS_MINGW

(3 ) 在osdep.h文件,紧接着#ifdef USE_REAL_PTHREAD加入#pragma comment(lib, "pthreadVC2.lib"),如下所示:

#ifdef USE_REAL_PTHREAD#pragma comment(lib, "pthreadVC2.lib")#define x264_pthread_t               pthread_t#define x264_pthread_create          pthread_create#define x264_pthread_join            pthread_join#define x264_pthread_mutex_t         pthread_mutex_t#define x264_pthread_mutex_init      pthread_mutex_init#define x264_pthread_mutex_destroy   pthread_mutex_destroy#define x264_pthread_mutex_lock      pthread_mutex_lock#define x264_pthread_mutex_unlock    pthread_mutex_unlock#define x264_pthread_cond_t          pthread_cond_t#define x264_pthread_cond_init       pthread_cond_init#define x264_pthread_cond_destroy    pthread_cond_destroy#define x264_pthread_cond_broadcast  pthread_cond_broadcast#define x264_pthread_cond_wait       pthread_cond_wait#else#define x264_pthread_mutex_t         int#define x264_pthread_mutex_init(m,f)#define x264_pthread_mutex_destroy(m)#define x264_pthread_mutex_lock(m)#define x264_pthread_mutex_unlock(m)#define x264_pthread_cond_t          int#define x264_pthread_cond_init(c,f)#define x264_pthread_cond_destroy(c)#define x264_pthread_cond_broadcast(c)#define x264_pthread_cond_wait(c,m)#endif

从而引用pthreadVC2.lib,重新编译。

调整项目属性意味着同时调整libx264和x264两个工程的属性。

经过如上调整编译出的X264就可以在--threads n (这里n≥2)的时候用完CPU的潜力了。如下图所示

 

2. x264的编码基本流程

 (1 )main函数

从代码的main()函数开始, 这个函数很简单,就是读取参数,然后编码。

int main( int argc, char **argv ){    x264_param_t param;    cli_opt_t opt;    int ret;#ifdef PTW32_STATIC_LIB    pthread_win32_process_attach_np();    pthread_win32_thread_attach_np();#endif#ifdef _WIN32    _setmode(_fileno(stdin), _O_BINARY);    _setmode(_fileno(stdout), _O_BINARY);#endif    x264_param_default( &m );    /* Parse command line */    if( Parse( argc, argv, &m, &opt ) < 0 )        return -1;    /* Control-C handler */    signal( SIGINT, SigIntHandler );    ret = Encode( &m, &opt );#ifdef PTW32_STATIC_LIB    pthread_win32_thread_detach_np();    pthread_win32_process_detach_np();#endif    return ret;}

现在重点考察编码函数static int  Encode( x264_param_t *param, cli_opt_t *opt ), 在这个函数里,将会使用到x264的API,如下所示:

/* x264_encoder_open: *      create a new encoder handler, all parameters from x264_param_t are copied */x264_t *x264_encoder_open   ( x264_param_t * );/* x264_encoder_reconfig: *      change encoder options while encoding, *      analysis-related parameters from x264_param_t are copied */int     x264_encoder_reconfig( x264_t *, x264_param_t * );/* x264_encoder_headers: *      return the SPS and PPS that will be used for the whole stream */int     x264_encoder_headers( x264_t *, x264_nal_t **, int * );/* x264_encoder_encode: *      encode one picture */int     x264_encoder_encode ( x264_t *, x264_nal_t **, int *, x264_picture_t *, x264_picture_t * );/* x264_encoder_close: *      close an encoder handler */void    x264_encoder_close  ( x264_t * );


首先,代码通过x264_encoder_open(param) 和 x264_picture_alloc(&pic, X264_CSP_I420, param->i_width, param->i_height)来初始化编码器以及为输入Yuv图像分配内存。接下来可以看到由两个注释隔开的代码块,它们的功能如下

 /* Encode frames */

每输入一帧即编码一帧

/* Flush delayed frames */

编码最终剩下的B帧,因为B帧需要在它之后的P帧来做参考,在它之后的P(或I)帧先编码,输入的待编码图像已经结束,但仍需将在最后一个P(或I)编码完成后对剩下的B进行编码。

Encode()最后的代码是进行编码器关闭和内存释放等。 

(2 )帧编码函数Encode_frame()

在上面的两个编码代码块中,主体函数是

static int  Encode_frame( x264_t *h, hnd_t hout, x264_picture_t *pic )

这个函数将输入每帧的YUV数据,然后编码并封装为nal包。编码码流的具体工作交由API

int  x264_encoder_encode( x264_t *h,x264_nal_t **pp_nal, int *pi_nal,x264_picture_t *pic_in, x264_picture_t *pic_out ) 来完成,它应该是x264中最重要的函数了。 

(3 )分析x264_encoder_encode()

 首先遇到参考帧调整如下,

static inline int x264_reference_update( x264_t *h )

 它会在h->frames.reference 保留需要的参考帧,然后根据参考帧队列的大小限制,移除不使用的参考帧。

 然后根据注释把代码块逐个往下分析:

  /* ------------------- Setup new frame from picture -------------------- */

  /* 1: Copy the picture to a frame and move it to a buffer */

  把帧输入的YUV数据传入 x264_frame_t *fenc中,然后进行一些码率控制方式的初始化。

   /* 2: Place the frame into the queue for its slice type decision */

  把fenc放到slice决定队列中,也输入码率控制的一部分

  /* 3: The picture is analyzed in the lookahead */

  分析slice类型,具体的类型决定工作将在函数void x264_slicetype_decide( x264_t *h )中处理。

  后面做码率控制分析的时候再详述。

  /* ------------------- Get frame to be encoded ------------------------- */

  /* 4: get picture to encode */

  去处编码帧,放置在h->fenc中,并重新设置编码参数。

  /* ------------------- Setup frame context ----------------------------- */

  /* 5: Init data dependent of frame type */

  根据帧类型设置i_nal_type,i_nal_ref_idc,h->sh.i_type ,如果是IDR帧,重置参考帧队列。

  /* ------------------- Init                ----------------------------- */

  根据当前帧建立参考帧队列,当前参考帧按编码帧类型分别写在h->fref0和h->fref1中。并整理好他们的排列顺序,h->fref0按poc从高到低,h->fref1反之。

  /* ---------------------- Write the bitstream -------------------------- */

 写NAL码流

  /* Write SPS and PPS */

 写参数集

 /* ------------------------ Create slice header  ----------------------- */

 初始化slice header参数

 /* Write frame */

 输出slice header和slice data 

 函数最后调用

 static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current,x264_nal_t **pp_nal, int *pi_nal, x264_picture_t *pic_out )

 来做NAL装,并且调整编码器状态和输出本帧编码的统计数据。 

(4 )static void *x264_slices_write( x264_t *h )

这个函数被x264_encoder_encode()调用作为处理slice header和slice data的编码,这个函数主要是分出slice group中的一个slice,具体做slice编码则在

static int x264_slice_write( x264_t *h )

这个函数的代码块划分如下:

step1. 初始化NAL,调用x264_slice_header_write()根据前面的参数设置输出slice header码流,

step2. 如果是用CABAC,则初始化其上下文。

step3. 进入宏块,逐个宏块编码:

宏块编码重要的是以下两个函数:

        x264_macroblock_analyse( h );

        x264_macroblock_encode( h );

其之前的代码是做宏块数据的导入,其后的代码是对编码数据进行熵编码,根据slicedata协议写入码流,更新coded_block_pattern,处理码率控制状态和更新CABAC上下文数据等。代码分析到宏块级了,就看看这个基本的编码单位是怎么被处理的吧。 

(5 )x264_macroblock_analyse( h )

这个函数就是分析宏块以确定其宏块分区模式,对I帧进行帧内预测和对P/B帧进行运动估计就发生在此函数,首先进行亮度编码,紧接着是色度。同样来一步步分析其实现。

step1. 进行码率控制准备,x264_mb_analyse_init()函数的功能包括:初始化码率控制的模型参数(码率控制依然基于Lagrangian率失真优化算法,所以初始化lambda系数),把各宏块分类的Cost设为COST_MAX,计算MV范围,快速决定Intra宏块。

step2. 根据h->sh.i_type的类型(I,P,B)来分别计算宏块模式的率失真代价,代价计算使用SATD方法。通过计算SATD可以大致估计编码码流,作为宏块选择的依据。

随机取h->mb.i_type == I_8x8的情况来分析,

            if( h->mb.b_lossless )

                x264_predict_lossless_8x8( h, p_dst, i, i_mode, edge );

            else

                h->predict_8x8[i_mode]( p_dst, edge );

            x264_mb_encode_i8x8( h, i, i_qp );

predict_8x8[i_mode]( p_dst, edge )将进行帧内预测,x264_mb_encode_i8x8( h, i, i_qp )进行DCT编码和量化,同时进行反量化和逆DCT编码,以备重建图像使用。

对于I8x8和I4x4的情况一般会进行分别做3个或15个块的预测和编码,留下一个块在x264_macroblock_encode( h )中再预测编码,原因是前面的块将作为后面编码块的预测依据。具体说会导致 i_pred_mode = x264_mb_predict_intra4x4_mode( h, 4*idx )的计算值发生变化。 

P/B帧的帧间预测将在接下来的代码段发生,具体的运动估计算法不在详述,以后将补充x264运动估计分析。 

step3. 根据i_mbrd的不同,做一些后续运算。 

( 6 ) x264_macroblock_encode( h )

在确定了宏块分区模式后,在本函数将对I帧剩余的宏块分区进行预测和编码,而对P/B帧的运动补偿和残差编码主要发生在这里。

基本流程分析到这里已经算结束了,在代码中,会发现宏块的预测和编码会散布在不同的函数发生,原因是对率失真优化的要求(对P/B帧)。所以,在x264中参考帧管理,码率控制,帧间预测和多线程编码都是比较有趣的探索对象。

3. 多线程代码分析

(1)文档解读

分析完X264的基本架构,来看看多线程发挥力量的地方。X264自带的多线程介绍文档是本课题的必读文档,它存放在X264的DOC文件夹下。本文描述的大意是:当前的X264多线程模式已经放弃基于slice的并行编码,转而采用帧级和宏块级的并行,原因是slice并行需要采用slice group,会引入而外冗余降低编码效率。原文如下:

Old threading method: slice-based
application calls x264
x264 runs B-adapt and ratecontrol (serial)
split frame into several slices, and spawn a thread for each slice
wait until all threads are done
deblock and hpel filter (serial)
return to application
In x264cli, there is one additional thread to decode the input.

New threading method: frame-based
application calls x264
x264 runs B-adapt and ratecontrol (serial to the application, but parallel to the other x264 threads)
spawn a thread for this frame
thread runs encode in 1 slice, deblock, hpel filter
meanwhile x264 waits for the oldest thread to finish
return to application, but the rest of the threads continue running in the background
No additional threads are needed to decode the input, unless decoding+B-adapt is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel to B-adapt.


Penalties for slice-based threading:
Each slice adds some bitrate (or equivalently reduces quality), for a variety of reasons: the slice header costs some bits, cabac contexts are reset, mvs and intra samples can't be predicted across the slice boundary.
In CBR mode, we have to allocate bits between slices before encoding them, which may lead to uneven quality.
Some parts of the encoder are serial, so it doesn't scale well with lots of cpus.

Penalties for frame-base threading:
To allow encoding of multiple frames in parallel, we have to ensure that any given macroblock uses motion vectors only from pieces of the reference frames that have been encoded already. This is usually not noticeable, but can matter for very fast upward motion.
We have to commit to one frame type before starting on the frame. Thus scenecut detection must run during the lowres pre-motion-estimation along with B-adapt, which makes it faster but less accurate than re-encoding the whole frame.
Ratecontrol gets delayed feedback, since it has to plan frame N before frame N-1 finishes.

以上的说明意味着,X264采用B帧在编码时不作为参考帧,所以适宜对其进行并行。

(2)运行状况分析

先来看看x264_pthread_create被调用的地方,只有这些地方才实实在在的创建了线程。

 x264_pthread_create( &h->thread_handle, NULL, (void*)x264_slices_write, h )

 x264_pthread_create( &look_h->thread_handle, NULL, (void *)x264_lookahead_thread, look_h )

 x264_pthread_create( &h->tid, NULL, (void*)read_frame_thread_int, h->next_args )

 x264多线程分析(引) - fellowher - fellowher的博客

 由上图的运行可以看出,在开启了--threads 4后。x264_slices_write()可以开启4个线程同时编码,而同时存在一个主线程和一个x264_lookahead_thread()线程。x264_slices_write()的优先级为低,原因是调用了

     if( h->param.i_sync_lookahead )

        x264_lower_thread_priority( 10 );

调低本线程的优先级。read_frame_thread_int()是读磁盘上的流数据信息,因为I/O和内存的不同步,所以应该分开线程处理。

 

在x264_encoder_open()中可以找到一下代码,可以看到对于x264_slices_write()和x264_lookahead_thread()都有被分配了专有的上下文变量,供单一线程使用。

    for( i = 1; i < h->param.i_threads + !!h->param.i_sync_lookahead; i++ )

        CHECKED_MALLOC( h->thread[i], sizeof(x264_t) );

 

(3)如何确保按指定线程数来开启线程编码?

按打印实验可以看到,假设使用--threads 4的参数选项,代码会同时开启4个x264_slices_write()线程,然后每编完一个帧(前面的一个线程返回后),一个新的被产生出来,使得x264_slices_write()线程总数保持在4个,这一过程的相关代码如下:

 

int     x264_encoder_encode( x264_t *h,x264_nal_t **pp_nal, int *pi_nal,x264_picture_t *pic_in,

                             x264_picture_t *pic_out )

{

...

    if( h->param.i_threads > 1)

    {

        int i = ++h->i_thread_phase;

        int t = h->param.i_threads;

        thread_current = h->thread[ i%t ];

        thread_prev    = h->thread[ (i-1)%t ];

        thread_oldest  = h->thread[ (i+1)%t ];

        x264_thread_sync_context( thread_current, thread_prev );

        x264_thread_sync_ratecontrol( thread_current, thread_prev, thread_oldest );

        h = thread_current;

    }

...

    /* Write frame */

    if( h->param.i_threads > 1 )

    {

        printf("x264_pthread_create/n");

        if( x264_pthread_create( &h->thread_handle, NULL, (void*)x264_slices_write, h ) )

            return -1;

        h->b_thread_active = 1;

    }

    else

        if( (intptr_t)x264_slices_write( h ) )

            return -1;

    return x264_encoder_frame_end( thread_oldest, thread_current, pp_nal, pi_nal, pic_out );

...

}

 

static int x264_encoder_frame_end( x264_t *h, x264_t *thread_current,x264_nal_t **pp_nal, int *pi_nal, x264_picture_t *pic_out )

{

...

    if( h->b_thread_active )

    {

        void *ret = NULL;

        x264_pthread_join( h->thread_handle, &ret );

        if( (intptr_t)ret )

            return (intptr_t)ret;

        h->b_thread_active = 0;

    }

...

}

 

从以上两个函数的代码段可以看到,h上下文中保持的线程不会多于4个, x264_pthread_create()根据主线程的调用,创建出x264_slices_write线程,然后thread_oldest被指定并被率控函数判断重设,当前的线程数还不足4的时候,thread_oldest指向新线程,h->b_thread_active为0,不能进入x264_encoder_frame_end()的相关代码,主线程继续循环创建x264_slices_write线程,当线程总数为4,这时thread_oldest指向4个线程中被判断最快返回的那个,这时h->b_thread_active=1将进入x264_pthread_join(),那样,该线程就将主线至于阻塞状态,直至thread_oldest完成,才能重现创建新线程,以此机制,保持指定数码的编码线程数。

 

(4)x264_lookahead_thread()线程的作用

在分析这个线程之前,来看看两个重要的线程控制函数:

//唤醒等待该条件变量的所有线程。如果没有等待的线程,则什么也不做。

#define x264_pthread_cond_broadcast  pthread_cond_broadcast

//自动解锁互斥量(如同执行了 pthread_unlock_mutex),并等待条件变量触发。这时线程挂起,不占用 CPU 

时间,直到条件变量被触发。在调用 pthread_cond_wait 之前,应用程序必须加锁互斥量。pthread_cond_wait 函数返回前,自动重新对互斥量加锁(如同执行了 pthread_lock_mutex)。

#define x264_pthread_cond_wait       pthread_cond_wait

 

以下的代码是X264中x264_lookahead_thread代码经常阻塞的地方,

**************************代码段A********************************************

        if( h->lookahead->next.i_size <= h->lookahead->i_slicetype_length )

        {

            while( !h->lookahead->ifbuf.i_size && !h->lookahead->b_exit_thread )

                x264_pthread_cond_wait( &h->lookahead->ifbuf.cv_fill, &h->lookahead->ifbuf.mutex );

            x264_pthread_mutex_unlock( &h->lookahead->ifbuf.mutex );

        }

        else

        {

            x264_pthread_mutex_unlock( &h->lookahead->ifbuf.mutex );

            x264_lookahead_slicetype_decide( h );

        }

这里是等待满足!h->lookahead->ifbuf.i_size && !h->lookahead->b_exit_thread 的条件,后一条件在正常编码过程是TRUE,因为不会无故退出线程。那么这里等待的其实是ifbuf.i_size为非0.查找相关代码,

 

这里的 ifbuf.i_size条件是在x264_synch_frame_list_push()得到满足的,这里在得到一个输入的新编码帧后将发出信号。

    slist->list[ slist->i_size++ ] = frame;

    x264_pthread_cond_broadcast( &slist->cv_fill );

在代码段A中,if( h->lookahead->next.i_size <= h->lookahead->i_slicetype_length )条件中,i_slicetype_length表示为了进行slice type的判断而缓存的帧,它的值有取决于h->frames.i_delay,由代码的初始化设定值决定(默认为40)。也就是说预存40帧的数值,进行slice type决定用。暂时不详细分析slice type判断的具体实现,它的大概思想是根据码率,GOP和失真状况的权衡,来进行帧类型选择,在类似实时通信场合,不允许B帧的使用,也不可能预存那么多帧,这样的处理没有意义。

 

回头看这里的处理意义,是阻塞线程,等待后续的输入帧,然后利用处理规则来决定其slice type,为slice编码准备帧。

 

(5)宏块级别的并行

在数据结构x264_frame_t中,有变量x264_pthread_cond_t  cv; 该变量分别在下面的两个函数里被封装了阻塞和唤醒:

void          x264_frame_cond_broadcast( x264_frame_t *frame, int i_lines_completed );

void          x264_frame_cond_wait( x264_frame_t *frame, int i_lines_completed );

考查它们被调用的地方,

************代码B****************from x264_macroblock_analyse( )->x264_mb_analyse_init()

int thresh = pix_y + h->param.analyse.i_mv_range_thread;

for( i = (h->sh.i_type == SLICE_TYPE_B); i >= 0; i-- )

{

    x264_frame_t **fref = i ? h->fref1 : h->fref0;

    int i_ref = i ? h->i_ref1 : h->i_ref0;

    for( j=0; j<i_ref; j++ )

    {

        x264_frame_cond_wait( fref[j], thresh );

        thread_mvy_range = X264_MIN( thread_mvy_range, fref[j]->i_lines_completed - pix_y );

    }

}

**************************代码C************************************from x264_fdec_filter_row()

if( h->param.i_threads > 1 && h->fdec->b_kept_as_ref )

{

    x264_frame_cond_broadcast( h->fdec, mb_y*16 + (b_end ? 10000 : -(X264_THREAD_HEIGHT <<h->sh.b_mbaff)) );

}

从上面的代码段可以看到没完成图像一行的编码,便会使用mb_y*16 -X264_THREAD_HEIGH的值来尝试唤醒x264_pthread_cond_wait( &frame->cv, &frame->mutex ),要判断的条件是

mb_y*16 -X264_THREAD_HEIGH < thresh = pix_y + h->param.analyse.i_mv_range_thread;

后者作为一个设想的阈值,用于确保依赖于本帧的后续帧在编码时,本帧已经编码出若干行宏块,以后续编码帧的基础,那样可以设想的情形如下图,不过X264是以编码完整行为单位的。

x264多线程分析(引) - fellowher - fellowher的博客

本文的分析道这里告一段落,对于帧间多线程分析和宏块的并行优化,或按自己的应用做代码裁剪,可以通过改正上面的(4)(5)代码段来实现,在当前(四核CPU)的X264测试中,已有代码确实能够很好的利用多核资源,并行编码的话题会随硬件的升级而不断探索下去。

(参考:http://blog.csdn.net/fanner01/article/details/6325226)

(转载请注明出处。)