Previously, with my SPU programs, I’ve been relying on heavy, gratuitous use of the param option to set various inlining thresholds absurdly high – the result being large programs that take a long time to compile, but run quite fast.
The alternative is a little bit more precision – working out where the compiler isn’t inlining something that would be beneficial to be inlined (i.e. handling sw cache hits) and forcing it to do so using always_inline.
The result? Faster compilation, smaller programs and (so far) programs that are as fast or faster – the compiler generally knows what it’s doing when it comes to inlining, there’s just some silly little, very hot, cache routines that it doesn’t handle well.