DX11中的constant buffer的padding原理

来源：互联网发布：起名网站源码编辑：程序博客网时间：2024/06/03 00:13

今天写了一些dx11的东西，有一个bug调了好长时间才抓到，总结下。
Constant Buffer是shader model 4.0以上的一个新的特性，主要的目的就是为了把离散的变量集中起来，从而提高传输数据的效率。
constanb buffer是需要做padding的，和cpu的padding逻辑相同，不过constant buffer的padding是128位对齐的。例如下面的结构

//  2 x 16byte elementscbuffer IE{    float4 Val1;    float2 Val2;  // starts a new vector    float2 Val3;};//  3 x 16byte elementscbuffer IE{    float2 Val1;    float4 Val2;  // starts a new vector    float2 Val3;  // starts a new vector};//  1 x 16byte elementscbuffer IE{    float1 Val1;    float1 Val2;    float2 Val3;};//  1 x 16byte elementscbuffer IE{    float1 Val1;    float2 Val2;    float1 Val3;};//  2 x 16byte elementscbuffer IE{    float1 Val1;    float1 Val1;    float1 Val1;    float2 Val2;    // starts a new vector};//  1 x 16byte elementscbuffer IE{    float3 Val1;    float1 Val2;};//  1 x 16byte elementscbuffer IE{    float1 Val1;    float3 Val2;};//  2 x 16byte elementscbuffer IE{    float1 Val1;    float1 Val1;    float3 Val2;        // starts a new vector};// 3 x 16byte elementscbuffer IE{    float1 Val1;    struct     {        float4 SVal1;    // starts a new vector        float1 SVal2;    // starts a new vector    } Val2;};// 3 x 16byte elementscbuffer IE{    float1 Val1;      struct     {        float1 SVal1;     // starts a new vector        float4 SVal2;     // starts a new vector    } Val2;};// 3 x 16byte elementscbuffer IE{    struct     {        float4 SVal1;        float1 SVal2;    // starts a new vector    } Val1;    float1 Val2;   // starts a new vector};

总体来讲就是每个变量尽可能的减少与他邻居变量的空隙，但是不能跨越一个float4.
上面的padding还是非常好理解的，而且也没有什么错误可以出现。因为他基本上和cpu中的Padding逻辑类似。不过下面的就把我弄悲剧了......

cbuffer ConstantBuffer : register(b0){float4 lightPos[8];float  lightDensity[8];};struct VS_Out{float4 pos : SV_POSITION;float  density : DENSITY;};VS_Out main(){VS_Out oData;float4 p = float4( 0.0f , 0.0f , 0.0f , 0.0f );for( int i = 0 ; i < 8 ; i++ ){float4 delta = lightPos[i] - p;oData.density+=length(delta)*lightDensity[i];}oData.pos = p;return oData;}

上面代码没有啥具体意义，只是用来把事情说明白而已。
如果CPU中同样定义了如上的constant buffer结构

struct ConstantBuffer{float lightPos[32];float  lightDensity[8];};

表面上看，两个struct基本一致，但是不同的Padding逻辑可以使得最后的数据完全混乱，而且用nsight来看，数据已经传入GPU中了，也不是特别容易看到。
cpu的struct逻辑很简单，因为是32位对齐的，所以size是160byte。
不过gpu的就完全不一样了，它的size是8*4*4+(7*4+1)*4=128+116=244
GPU上面，实际把每一个单独的float都扩展成为了float4，从而减少了循环中的ALU计算，提高了效率。
看汇编code可以看的很清楚。

vs_5_0dcl_globalFlags refactoringAlloweddcl_constantbuffer cb0[16], dynamicIndexeddcl_output_siv o0.xyzw, positiondcl_output o1.xdcl_temps 1mov r0.xy, l(0,0,0,0)loop  ige r0.z, r0.y, l(8)  breakc_nz r0.z  dp4 r0.z, cb0[r0.y + 0].xyzw, cb0[r0.y + 0].xyzw  sqrt r0.z, r0.z  mad r0.x, r0.z, cb0[r0.y + 8].x, r0.x  iadd r0.y, r0.y, l(1)endloopmov o1.x, r0.xmov o0.xyzw, l(0,0,0,0)ret

如果这里不注意的话，那么就会引起数据对应错误，从而导致shader计算结果错误，相对来说比较难查出来。