I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.
Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.
To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get
R = (a>>2) + (b>>2) + (c>>2) + (d>>2);
but R will be as much as 3 less than the actual average. To fix this up, the average of the lower two bits can be calculated and added back to the result.
L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point R += (L>>2); // divide by four and add to the result R += (L>>1)&1; // round up based on the sum of lower bits
R should be an accurate average of the four uchars, without overflow.
Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.
qword average4(qword a, qword b, qword c, qword d) { // shift each right by 2 bits, masking shifted-in bits from the result qword au = si_andbi(si_rotqmbii(a, -2), 0x3f); qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f); qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f); qword du = si_andbi(si_rotqmbii(d, -2), 0x3f); // add them all up qword R = si_a(si_a(au,bu), si_a(cu,du)); // add up the lower bits qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)), si_a(si_andbi(c,3),si_andbi(d,3))); // shift right 2 bits, again masking out shifted-in high bits R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3)); // shift right and mask for the rounding bit R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1)); return R; }
(24 instructions – 20ish cycles)
While it is fast enough for my present use, I’m sure it’s ripe for improvement.
qword