Some further thoughts, continuing on from my previous average post…

### Alternate methods

sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo. sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be all that would be needed to turn the result from two sumb calls into the desired averages.

Unfortunately, the input format I’m using isn’t well suited to sumb and it would appear to require a prohibitive number of shuffles to prepare the data appropriately – that said, there’s at least one place where I shuffle data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.

The avgb instruction calculates the average of the bytes in two quadwords. It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there’s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I’ve not yet been able to wrap my head around the correct method to do so.

### Approximating

There are plenty of ways to very quickly get a result that is often — but not always — correct (like avgb). One of these methods may be suitable for my particular needs, but I won’t know until later. My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.

### Adding

I’m annoyed with myself that I missed this one, as I’ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had

// add up the lower bits qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)), si_a(si_andbi(c,3),si_andbi(d,3))); // shift right 2 bits, again masking out shifted-in high bits R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3)); // shift right and mask for the rounding bit R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));

adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:

qword average4(qword a, qword b, qword c, qword d) { // shift each right by 2 bits, masking shifted-in bits from the result qword au = si_andbi(si_rotqmbii(a, -2), 0x3f); qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f); qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f); qword du = si_andbi(si_rotqmbii(d, -2), 0x3f); // add them all up qword R = si_a(si_a(au,bu), si_a(cu,du)); // add up the lower bits qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)), si_a(si_andbi(c,3),si_andbi(d,3))); // add 2 L = si_a(L, si_ilh(0x202)); // shift right 2 bits, again masking out shifted-in high bits R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3)); return R; }

The difference is pretty minor — a couple of instructions and (when not inlined) it’s no faster. For the program it’s used in I’m seeing around a 1.5% runtime reduction.