spu_µopt[0] // shifting and arrays

There are a couple of mostly-obvious SPU micro-optimisation tricks that I’ve learned. Here’s one.

qword a[128];
qword f(int i) {
    return a[i>>4];
}

Simple enough code. spu-gcc -O3 produces

f:                                                                                                                                                                                                                                                                                                                          
    ila     $2,a                                                                                                                                                                                                                                                                                                        
    andi    $3,$3,-16                                                                                                                                                                                                                                                                                                   
    lqx     $3,$2,$3                                                                                                                                                                                                                                                                                                    
    bi      $lr

And it’s worse if the shift is not 4 – here’s the result of using 5:

f:
    rotmai  $4,$3,-5
    ila     $2,a
    shli    $3,$4,4
    lqx     $3,$2,$3
    bi      $lr

Which is pretty ugly.

The compiler is doing what it has been asked to do, but doesn’t appear to know that SPU load and store instructions force the four rightmost bits of the calculated address to zero.  Which means the andi masking can be removed from the first case, and in the second the shift right and shift left instructions could be replaced with a single shift right – without changing the result.

How to avoid it?  I’ve used manual pointer arithmetic to get around this compiler feature, e.g.

return *(qword*)((uint)a+i)

Which is pretty ugly.

I don’t know what that sort of thing does for the compiler’s alias analysis. Probably nothing good (judicious use of restrict might help…).  si_lqx() would also work, but my  experience – without in-depth examination – is that measured performance gets worse when using load and store intrinsics. I’m not sure if it’s something to do with scheduling, aliasing or programmer competence.

When would this ever matter? I’ve run into this in some very hot lookup table access. Two or four less cycles can make a big difference. Sometimes. (If it really matters, rewriting the whole thing in assembly can work, too)

It’s a micro-optimisation – measure often, use with care.