## Averaging more unsigned chars

Some further thoughts, continuing on from my previous average post

### Alternate methods

sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo.  sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be all that would be needed to turn the result from two sumb calls into the desired averages.

Unfortunately, the input format I’m using isn’t well suited to sumb and it would appear to require a prohibitive number of shuffles to prepare the data appropriately – that said, there’s at least one place where I shuffle data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.

The avgb instruction calculates the average of the bytes in two quadwords.  It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there’s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I’ve not yet been able to wrap my head around the correct method to do so.

### Approximating

There are plenty of ways to very quickly get a result that is often — but not always — correct (like avgb).  One of these methods may be suitable for my particular needs, but I won’t know until later.  My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.

I’m annoyed with myself that I missed this one, as I’ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had

```    // add up the lower bits
qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
si_a(si_andbi(c,3),si_andbi(d,3)));

// shift right 2 bits, again masking out shifted-in high bits
R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

// shift right and mask for the rounding bit
R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));```

adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:

```qword average4(qword a, qword b, qword c, qword d) {
// shift each right by 2 bits, masking shifted-in bits from the result
qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

qword R = si_a(si_a(au,bu), si_a(cu,du));

// add up the lower bits
qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
si_a(si_andbi(c,3),si_andbi(d,3)));

L = si_a(L, si_ilh(0x202));

// shift right 2 bits, again masking out shifted-in high bits
R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

return R;
}```

The difference is pretty minor — a couple of instructions and (when not inlined) it’s no faster.  For the program it’s used in I’m seeing around a 1.5% runtime reduction.

## Averaging unsigned chars

I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.

Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.

To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get

`R = (a>>2) + (b>>2) + (c>>2) + (d>>2);`

but R will be as much as 3 less than the actual average.  To fix this up, the average of the lower two bits can be calculated and added back to the result.

```L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point
R += (L>>2);   // divide by four and add to the result
R += (L>>1)&1; // round up based on the sum of lower bits```

R should be an accurate average of the four uchars, without overflow.

Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.

```qword average4(qword a, qword b, qword c, qword d) {
// shift each right by 2 bits, masking shifted-in bits from the result
qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

qword R = si_a(si_a(au,bu), si_a(cu,du));

// add up the lower bits
qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
si_a(si_andbi(c,3),si_andbi(d,3)));

// shift right 2 bits, again masking out shifted-in high bits
R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

// shift right and mask for the rounding bit
R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));
return R;
}```

(24 instructions – 20ish cycles)

While it is fast enough for my present use, I’m sure it’s ripe for improvement.

`qword`