In general, C is a lousy language for expressing this kind of parallelism on the SPU. The original loop that ‘inspired’ this nonsense looks something like :
for (j = 0; j < num_indexes; j += 3) { const float *v0, *v1, *v2; v0 = (const float *) (vertices + indexes[j+0] * vertex_size); v1 = (const float *) (vertices + indexes[j+1] * vertex_size); v2 = (const float *) (vertices + indexes[j+2] * vertex_size); func(v0, v1, v2); }
which is quite clear and straightforward to read, but with hidden complexity – the lack of quadword alignment, the way it is expressed as three seperate multiply-adds, and the separation into three (unpacked) variables which are repacked inside func().