And now I know – adventures in double precision

Refining the buddhabrot renderer, I’ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah.

To begin with, I lifted code from one of the later revisions of Jeremy’s Mandelbrot renderer. This was written for single precision float, whereas I’ve been working in double precision for this buddhabrot code.  Worth noting on the change from single to double precision –

  • Double precision numbers behave differently to single precision on the SPU (see section 9 of the SPU ISA doc) – I was bitten by infs and NaNs.
  • When browsing that document, I missed the large “Optional v1.2” for instructions like dfcgt. To be clear, the Cell BE SPU does not support this instruction.
  • GCC does include vec_ullong2 spu_cmpgt(vec_double2, vec_double2), but in the absence of dfcgt it takes forty extra instructions to achieve the same result (yeah, that’s what I get for using general intrinsics)

When starting to use double precision, I was expecting much lower performance than single precision on the SPU, but I had not fully understood how much lower – from the Programming Handbook, page 71:

Although double-precision instructions have 13-clock-cycle latencies, on the Cell/B.E. processor, only the final seven cycles are pipelined. No other instructions are dual-issued with double-precision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued.

Ouch.  I knew this, but I didn’t know it – a run of spu_timing on the generated assembly really rammed it home.

0  0123456789012                                      dfs  $75,$45,$44
0   ------7890123456789                               dfma $46,$59,$47
0          ------4567890123456                        dfa  $43,$45,$44
0                 ------1234567890123                 dfa  $42,$80,$75
0                        ------8901234567890          dfm  $32,$46,$46
0                               ------5678901234567   frds $40,$43
0  01234                               ------23456789 dfm  $33,$42,$42
0  012345678901                               ------9 dfm  $36,$42,$81

(Oh, and I’ve noticed again that dfma and friends use RT as an operand, which presumably makes register scheduling even more fun. The above fragment is from a heavily unrolled inner loop.)

At some point, I’ll try to measure the practical difference between double and single precision for this program, to see what (if anything) would be lost by switching over to single precision. Or perhaps there’s some other way around the problem – I’ve been considering fixed point or even multi-single precision fp alternatives.

Three buddhabrot

I’ve been experimenting with buddhabrot colouring tonight (actually, I think these are nebulabrot, although the colour composition isn’t as nice as I’d like).

Colouring is based on three passes with different parameters, with each hit on a pixel incrementing the colour channel (with saturation).

Click each one for a 1080p version.

Blue: 312-5,000  Green: 625-10,000  Red: 1250-20,000

Blue: 19-5,000  Green: 39-10,000  Red: 78-20,000

Blue: 10-5,000  Green: 5,000-10,000  Red: 10,000-15,000

CellBE Buddhabrot renderer

For my next TUCS tech talk I’ll be continuing on from the Mandelbrot rendering in the last one (which can be seen here) to something a little more complex.

15c2

The Buddhabrot is conceptually not any more complex than the Mandelbrot in terms of its generation – rather than colouring points based on the number of iterations before they ‘escape’, we apply colour to each point reached while iterating escaping starting points.  This has consequences for the drawing of the Buddhabrot – rather than generating one point at a time independently of all other points in the output, iterating a single input point may effect thousands of different output points.  This makes it all trickier when implementing this on the Cell BE – parallel writes by SPEs to shared locations will need some form of synchronisation.  That could be messy, and the process of load/modify/store when expressed in terms of SPU DMA can be quite clumsy.

Rather than try to implement a complex locking/synchronisation system, I have tried to apply some ideas from a set of post-it notes by Mike Acton (you can see them here).  This isn’t identical to Mike’s solution, because it’s not the same problem.

To explain – each SPE thread iterates various points on the screen, and generates a list of points to be written.  This list of points is sent via DMA to a buffer for the SPE’s use the PPE, which proceeds through the list plotting the points to the framebuffer. The advantage of this approach is that there is only one writer to the framebuffer (the PPE), and that each SPE has it’s own buffers to write its data into. The only synchronisation that is necessary is between each SPE and the PPE to ensure that all data in a buffer is consumed before writing more into it.  This is achieved through the use of interrupt mailboxes (SPE tells PPE that there is data), a fenced DMA to act as sentinel (the PPE spins on the arrival of the sentinel data to ensure that DMA of a buffer has completed – this doesn’t feel like the right way to solve this particular problem, though), and the SPE signal register in OR mode to inform the SPE that a particular buffer has been finished with.  Interrupt mailbox events are aggregated through libspe2’s spe_event_*() functions.

It’s not an especially complex piece of code – the motivation in its writing is for my own interest and to use for the tech talk. I think it will do nicely for explaining some of the complexities and curiosities of the Cell BE architecture, and the programming of it with the IBM SDK.

There are a few extra features that I’d like to add – particularly better colouring (including saturation which is unfortunately apparent in its absence), and a number of optimisations to the render_fractal() function that I need to lift from my earlier Mandelbrot efforts.

The program includes code by Jeremy Kerr (See hackfest items at http://ozlabs.org/~jk/diary/tech/cell/) and Mike Acton (framebuffer utilities, from http://cellperformance.beyond3d.com/articles/2007/03/handy-ps3-linux-framebuffer-utilities.html).  My thanks to Jeremy and Mike, and to all those that have offered comments & feedback via twitter.

[edit: Oh, and it includes cheriff’s fine VNC code ;)]

Read the code: fractal.c and spe-fractal.c, or grab a tarball.  Comments & suggestions most welcome.

Addition: I added pixel value saturation and experimented with some alternative approaches to colouring…

5cc(Click for larger version)