Refining the buddhabrot renderer, I’ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah.
To begin with, I lifted code from one of the later revisions of Jeremy’s Mandelbrot renderer. This was written for single precision float, whereas I’ve been working in double precision for this buddhabrot code. Worth noting on the change from single to double precision –
- Double precision numbers behave differently to single precision on the SPU (see section 9 of the SPU ISA doc) – I was bitten by infs and NaNs.
- When browsing that document, I missed the large “Optional v1.2” for instructions like dfcgt. To be clear, the Cell BE SPU does not support this instruction.
- GCC does include vec_ullong2 spu_cmpgt(vec_double2, vec_double2), but in the absence of dfcgt it takes forty extra instructions to achieve the same result (yeah, that’s what I get for using general intrinsics)
When starting to use double precision, I was expecting much lower performance than single precision on the SPU, but I had not fully understood how much lower – from the Programming Handbook, page 71:
Although double-precision instructions have 13-clock-cycle latencies, on the Cell/B.E. processor, only the final seven cycles are pipelined. No other instructions are dual-issued with double-precision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued.
Ouch. I knew this, but I didn’t know it – a run of spu_timing on the generated assembly really rammed it home.
0 0123456789012 dfs $75,$45,$44
0 ------7890123456789 dfma $46,$59,$47
0 ------4567890123456 dfa $43,$45,$44
0 ------1234567890123 dfa $42,$80,$75
0 ------8901234567890 dfm $32,$46,$46
0 ------5678901234567 frds $40,$43
0 01234 ------23456789 dfm $33,$42,$42
0 012345678901 ------9 dfm $36,$42,$81
(Oh, and I’ve noticed again that dfma and friends use RT as an operand, which presumably makes register scheduling even more fun. The above fragment is from a heavily unrolled inner loop.)
At some point, I’ll try to measure the practical difference between double and single precision for this program, to see what (if anything) would be lost by switching over to single precision. Or perhaps there’s some other way around the problem – I’ve been considering fixed point or even multi-single precision fp alternatives.