spu
spu_µopt[0] // shifting and arrays
by jonathan on May.07, 2010, under general, programming, spu
There are a couple of mostly-obvious SPU micro-optimisation tricks that I’ve learned. Here’s one.
qword a[128];
qword f(int i) {
return a[i>>4];
}
Simple enough code. spu-gcc -O3 produces
f:
ila $2,a
andi $3,$3,-16
lqx $3,$2,$3
bi $lr
And it’s worse if the shift is not 4 – here’s the result of using 5:
f:
rotmai $4,$3,-5
ila $2,a
shli $3,$4,4
lqx $3,$2,$3
bi $lr
Which is pretty ugly.
The compiler is doing what it has been asked to do, but doesn’t appear to know that SPU load and store instructions force the four rightmost bits of the calculated address to zero. Which means the andi masking can be removed from the first case, and in the second the shift right and shift left instructions could be replaced with a single shift right – without changing the result.
How to avoid it? I’ve used manual pointer arithmetic to get around this compiler feature, e.g.
return *(qword*)((uint)a+i)
Which is pretty ugly.
I don’t know what that sort of thing does for the compiler’s alias analysis. Probably nothing good (judicious use of restrict might help…). si_lqx() would also work, but my experience – without in-depth examination – is that measured performance gets worse when using load and store intrinsics. I’m not sure if it’s something to do with scheduling, aliasing or programmer competence.
When would this ever matter? I’ve run into this in some very hot lookup table access. Two or four less cycles can make a big difference. Sometimes. (If it really matters, rewriting the whole thing in assembly can work, too)
It’s a micro-optimisation – measure often, use with care.
Building gdb for Cell/B.E.
by jonathan on Apr.05, 2010, under ps3, spu
Trying to debug a bus error on my PS3, I realised that the version of GDB I have installed doesn’t support debugging of SPU programs. There doesn’t seem to be a Debian packaged version available that does, so I built my own.
Because I found no obvious google result, I share this with the zero other people that I expect may one day be interested : the key option for configure appears to be
--enable-targets=spu
This information was brought to you via the gdb.spec file, and a post to the gcc-testresults mailing list.
And now I know – adventures in double precision
by jonathan on Sep.21, 2009, under general, spu
Refining the buddhabrot renderer, I’ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah.
To begin with, I lifted code from one of the later revisions of Jeremy’s Mandelbrot renderer. This was written for single precision float, whereas I’ve been working in double precision for this buddhabrot code. Worth noting on the change from single to double precision -
- Double precision numbers behave differently to single precision on the SPU (see section 9 of the SPU ISA doc) – I was bitten by infs and NaNs.
- When browsing that document, I missed the large “Optional v1.2″ for instructions like dfcgt. To be clear, the Cell BE SPU does not support this instruction.
- GCC does include vec_ullong2 spu_cmpgt(vec_double2, vec_double2), but in the absence of dfcgt it takes forty extra instructions to achieve the same result (yeah, that’s what I get for using general intrinsics)
When starting to use double precision, I was expecting much lower performance than single precision on the SPU, but I had not fully understood how much lower – from the Programming Handbook, page 71:
Although double-precision instructions have 13-clock-cycle latencies, on the Cell/B.E. processor, only the final seven cycles are pipelined. No other instructions are dual-issued with double-precision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued.
Ouch. I knew this, but I didn’t know it – a run of spu_timing on the generated assembly really rammed it home.
0 0123456789012 dfs $75,$45,$44 0 ------7890123456789 dfma $46,$59,$47 0 ------4567890123456 dfa $43,$45,$44 0 ------1234567890123 dfa $42,$80,$75 0 ------8901234567890 dfm $32,$46,$46 0 ------5678901234567 frds $40,$43 0 01234 ------23456789 dfm $33,$42,$42 0 012345678901 ------9 dfm $36,$42,$81
(Oh, and I’ve noticed again that dfma and friends use RT as an operand, which presumably makes register scheduling even more fun. The above fragment is from a heavily unrolled inner loop.)
At some point, I’ll try to measure the practical difference between double and single precision for this program, to see what (if anything) would be lost by switching over to single precision. Or perhaps there’s some other way around the problem – I’ve been considering fixed point or even multi-single precision fp alternatives.
CellBE Buddhabrot renderer
by jonathan on Sep.10, 2009, under general, ps3, spu
For my next TUCS tech talk I’ll be continuing on from the Mandelbrot rendering in the last one (which can be seen here) to something a little more complex.
The Buddhabrot is conceptually not any more complex than the Mandelbrot in terms of its generation – rather than colouring points based on the number of iterations before they ‘escape’, we apply colour to each point reached while iterating escaping starting points. This has consequences for the drawing of the Buddhabrot – rather than generating one point at a time independently of all other points in the output, iterating a single input point may effect thousands of different output points. This makes it all trickier when implementing this on the Cell BE – parallel writes by SPEs to shared locations will need some form of synchronisation. That could be messy, and the process of load/modify/store when expressed in terms of SPU DMA can be quite clumsy.
Rather than try to implement a complex locking/synchronisation system, I have tried to apply some ideas from a set of post-it notes by Mike Acton (you can see them here). This isn’t identical to Mike’s solution, because it’s not the same problem.
To explain – each SPE thread iterates various points on the screen, and generates a list of points to be written. This list of points is sent via DMA to a buffer for the SPE’s use the PPE, which proceeds through the list plotting the points to the framebuffer. The advantage of this approach is that there is only one writer to the framebuffer (the PPE), and that each SPE has it’s own buffers to write its data into. The only synchronisation that is necessary is between each SPE and the PPE to ensure that all data in a buffer is consumed before writing more into it. This is achieved through the use of interrupt mailboxes (SPE tells PPE that there is data), a fenced DMA to act as sentinel (the PPE spins on the arrival of the sentinel data to ensure that DMA of a buffer has completed – this doesn’t feel like the right way to solve this particular problem, though), and the SPE signal register in OR mode to inform the SPE that a particular buffer has been finished with. Interrupt mailbox events are aggregated through libspe2’s spe_event_*() functions.
It’s not an especially complex piece of code – the motivation in its writing is for my own interest and to use for the tech talk. I think it will do nicely for explaining some of the complexities and curiosities of the Cell BE architecture, and the programming of it with the IBM SDK.
There are a few extra features that I’d like to add – particularly better colouring (including saturation which is unfortunately apparent in its absence), and a number of optimisations to the render_fractal() function that I need to lift from my earlier Mandelbrot efforts.
The program includes code by Jeremy Kerr (See hackfest items at http://ozlabs.org/~jk/diary/tech/cell/) and Mike Acton (framebuffer utilities, from http://cellperformance.beyond3d.com/articles/2007/03/handy-ps3-linux-framebuffer-utilities.html). My thanks to Jeremy and Mike, and to all those that have offered comments & feedback via twitter.
[edit: Oh, and it includes cheriff's fine VNC code ;)]
Read the code: fractal.c and spe-fractal.c, or grab a tarball. Comments & suggestions most welcome.
Addition: I added pixel value saturation and experimented with some alternative approaches to colouring…
Some recent SPU toolchain patches
by jonathan on Jul.17, 2009, under general, spu
A couple of patches that I’ve noticed on various mailing lists -
libspe – Some small changes, but includes spe_image_open_library() to load an spe image from a ppe shared library.
binutils – “Also, if DLL’s were supported on SPU…” Interesting idea – will have to wait to see what comes of it.
gcc – Support non-constants as the second argument of __builtin_expect. Interesting idea.
(Also, Revital Eres’s function partitioning patch has had some activity, and there’s the odd patch from Alan Modra on overlays and software icache.)
CellBE frameworks
by jonathan on Jun.16, 2009, under general, spu, tumble
celltask “A clean task interface to Cell programmers to start jobs in SPUs by hiding the tedious context/pthread creation, mailbox/signal/interrupt mailbox communication, etc.” (new)
speutils “…instance initiators for various types of posix threads to run the spe programs as well as a instance oriented message passing interface.”
“MARS (Multicore Application Runtime System) is a set of libraries that provides an API to easily manage and create user programs that will be scheduled to run on various microprocessing units of a multicore environment.”
spumedia “… to provide accellerators for the cell broadband engine processor”
spexms “Simple library for creating spe accelerators for the CELL BE and Playstation 3″
(This post brought to you by I’m Putting These All In One Place So I Don’t Have To Search For Them Again. And the letter N.)
Fedora 11, PS3 & SPU programming
by jonathan on Jun.15, 2009, under general, ps3, spu, tumble
Some notes on Fedora 11 on PS3
Intro to Cell, part 6 and part 7 the excellent series from NotZed continues.
2D Polygon rendering demo(3067FPS ver.) on PS3 Linux
YUV to RGB on SPU More in-depth spu programming from NotZed.
Insomniac games Nocturnal Initiative and also #nocturnal on irc.freenode.net (and Insomniac’s R&D page is always a good read)
A new compiler and a visual guide
by jonathan on Jun.05, 2009, under general, ps3, spu
From http://t-platforms.ru/en/cell/cellcompiler.php :
T-Platforms Cell Compiler is a single source compiler that explores the power of Cell/B.E.™ multicore architecture through auto-parallelization and auto-vectorization of source code of applications written in a sequential manner and language (C/C++, Fortran).
They’re looking for beta-testers – I’ve applied.
Visual guide for SPU instructions Jon has put together a fantastic document illustrating the behaviour of most SPU instructions – very enlightening and a wonderful reference.
Mars 1.1.3 has been released. “MARS is a multi-tasking runtime system for multi-core processors. The current implementation is for the Cell BE. The system is co-processor centric, in that the co-processors each run a micro kernel that generally runs independent of the host processor.”
20090526
by jonathan on May.26, 2009, under general, spu, tumble
global namespace afuse + sshfs
a lesson is learned comic by braid artist (care of dinosaur comics)
Julie Cohen publications care of comments from a creativity and copyright presentation via @PeterBlackQUT
they write the write stuff space shuttle programming
spacewalk systems management
-funroll-loops
by jonathan on Jan.16, 2009, under spu
In general, C is a lousy language for expressing this kind of parallelism on the SPU. The original loop that ‘inspired’ this nonsense looks something like :
for (j = 0; j < num_indexes; j += 3) {
const float *v0, *v1, *v2;
v0 = (const float *) (vertices + indexes[j+0] * vertex_size);
v1 = (const float *) (vertices + indexes[j+1] * vertex_size);
v2 = (const float *) (vertices + indexes[j+2] * vertex_size);
func(v0, v1, v2);
}
which is quite clear and straightforward to read, but with hidden complexity – the lack of quadword alignment, the way it is expressed as three seperate multiply-adds, and the separation into three (unpacked) variables which are repacked inside func().

