{1,2,3,4}

(This is wonderfully obtuse, but amused me :)

Neil Henning (@sheredom) asked:

SPU gurus of twitter unite, want a vector unsigned int with {1, 2, 3, 4} in each slot, without putting it in as elf constant, any ideas?

Interesting question. The SPU ISA generally doesn’t help build vectors with different values in each slot. In this case, there are only very small values required in each register, so it can be done with a neat little trick.

My answer:

    fsmbi r4, 0x7310  # r4 = {0x00ffffff, 0x0000ffff, 0x000000ff, 0x00000000}
    clz r5, r4        # r5 = {8,16,24,32}
    rotmi r6, r5, -3  # r6 = {1,2,3,4}

Instructions are:

  • fsmbi — form select mask byte immediate. Creates a 128 bit mask from a 16 bit value, expanding each bit of input to 8 bits of output.
  • clz — count leading zeroes. Counts the number of leading zeros in each word.
  • rotmi — rotate and mask word immediate (logical shift right by negative immediate). Shifts each word right by the negation of number of bits specified.

This solution is entirely self contained, required no pre-set state (unlike my first attempt utilising the cbd instruction). In terms of raw instruction size, it’s a whole eight bytes smaller than storing the vector in memory and loading it when needed (that being 16+4 bytes), and a little slower than using a load instruction.

(On a cursory re-examination of the SPU ISA, fsmbi is the only instruction that will construct a different value in each word of a register. A specific pattern may be generated with cbd/cbx that can be used for this problem, but it depends on the contents of another register which limits its already limited usefulness. Combining fsmbi with other immediate instructions allows for a wide range of values to be constructed independent of register state and without access to storage)

Frame times are like pancakes

Time to generate image data, no DMA: 0.015962s

Time to generate data, with DMA to framebuffer: 0.048529s.  That’s not good.

Total amount of data transferred: one 1080p frame — ~8MB.

Measured average latency of 4*16KB DMA puts: 577µs — ~1000 times slower than expected.  Something’s really not right.

Oh.

First accesses to memory will be more expensive.

Time to generate and DMA first frame: 0.048529s

Time to generate and DMA second frame: 0.016370s

Time to generate and DMA subsequent frames: 0.015944s — slightly faster than with no DMA.

There’s still plenty of work to be done before it’s functional and interesting, but I may yet hit my goal of 60 fps for this project :)

Averaging more unsigned chars

Some further thoughts, continuing on from my previous average post

Alternate methods

sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo.  sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be all that would be needed to turn the result from two sumb calls into the desired averages.

Unfortunately, the input format I’m using isn’t well suited to sumb and it would appear to require a prohibitive number of shuffles to prepare the data appropriately – that said, there’s at least one place where I shuffle data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.

The avgb instruction calculates the average of the bytes in two quadwords.  It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there’s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I’ve not yet been able to wrap my head around the correct method to do so.

Approximating

There are plenty of ways to very quickly get a result that is often — but not always — correct (like avgb).  One of these methods may be suitable for my particular needs, but I won’t know until later.  My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.

Adding

I’m annoyed with myself that I missed this one, as I’ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));

adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // add 2
    L = si_a(L, si_ilh(0x202));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    return R;
}

The difference is pretty minor — a couple of instructions and (when not inlined) it’s no faster.  For the program it’s used in I’m seeing around a 1.5% runtime reduction.

ioctl() and the spu — a tale of many magics

I’d achieved a working implementation of the small spu project I’ve been messing around with and wanted to move on to start pushing particular pixels (rather than debugging in text mode). My program, up to this point, has been a simple spu-only piece of code (launched via the elfspe loader under Linux), and I was reluctant to tack on a some kind of ppu manager just to set up the framebuffer — so, I though, why not do it from the spu itself? (Why not? Because it’s a stupid idea and so on.)

What’s needed to “set up the framebuffer”?  Magic.  Fortunately, the magic has been well documented by Mike Acton and you can get the source here. The trick becomes merely to get those files compiled for the spu, and the biggest barrier to doing that is the ioctl() system call.

There was no existing way to call ioctl() directly from the spu and several possible approaches were apparent.

Callback handler in libspe

libspe provides a number of callback functions for spu programs, allowing them some degree of access to and control of the wider system.  These are built on the __send_to_ppe() function found in newlib. __send_to_ppe() packs up a function’s arguments and stops the spu, signaling to libspe which of several handlers should be utilised.

My first partially-working attempt to access ioctl() used this approach, and once I’d worked out how to debug my way into libspe and back to the spu the solution was quite straightforward.  Particularly, libspe mmaps the spu’s local store, making it possible to pass a pointer to the spu’s local store directly to ioctl().

The third argument to ioctl() can be a pointer or a value and so some decoding of the ‘request’ argument would be required to handle particular cases correctly. While that’s probably not too difficult (probably — I’m not clear on the details), it’s beyond what I was interested in pursuing. For syscalls, though, there is a ‘better’ way…

Stopcode 0x2104

With __send_to_ppe(), newlib provides a similar __linux_syscall() function.  I couldn’t make much sense of it when I first looked at it as the 0x2104 stopcode it uses is not intercepted by libspe. Instead, it is magic and is intercepted directly by the kernel for the express purpose of handling syscalls. Hooray?

Almost. If the third argument to ioctl() is a pointer, it needs to contain a valid ea. Unfortunately, there appears to be no sane way for a spu to mmap it’s own local store, meaning an lsa cannot be converted to a valid ea without some external assistance, and so we must consider other options…

__ea

Named address spaces are an extension to the C language that provide a way to access eas from a spu in a somewhat transparent fashion.  On the spu, these work by qualifying pointers with __ea to indicate that they contain an ea, and the complier handles the intervening magic (gcc-4.5 and the various versions of gcc and xlc that have shipped with the IBM SDK have support for this extension).

While this method might work, I didn’t want to give up a chunk of my local store for the software cache that the implementation in gcc uses — I have a hunch that I’ll will almost run out of space when I add some extra DMA buffers to this program. (And I’m not sure what’s needed to get  __ea storage working correctly without a separate ppu context).

A more convoluted method

In the absence of a neater approach, here’s how I got it working:

  1. mmap some scratch space in main memory (using mmap_eaddr() from newlib)
  2. call ioctl() with the address of that scratch space
  3. DMA the data from the scratch space to local store

Something like:

void* scratch = (void*)(uint32_t)mmap_eaddr(0ULL, size, PROT_READ|PROT_WRITE,
                                            MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ioctl(fb_fd, FBIOGET_VBLANK, scratch);
struct fb_vblank vblank;
spu_mfcdma32(&vblank, (uint32_t)scratch, ROUNDUP16(sizeof(vblank)),
             0, MFC_GET_CMD);
mfc_write_tag_mask(1);
spu_mfcstat(MFC_TAG_UPDATE_ALL);

And so, ugly as it is, I can now call ioctl() from spu code.  Lather, rinse & repeat as desired.

Of course, there were a few other problems to solve: because I was compiling cp_{fb,vt}.c for the spu but using the (ppu) system headers to get the fb defns, the compiler sees the prototypes for the system mmap, munmap, printf and snprintf functions. Extra wrappers are needed for each of those (and so I now know about the v*printf family).

Once all that is done, it works. I can push pixels to the screen from the spu.

Flipping and vsync is an interesting, and slightly awkward, case as they are passed the address of a value, not the value itself (for no readily apparent reason), so the values are allocated early on and referenced as needed.

The implementation can be found here. It’s functional. :)  There are many opportunities to tidy it up and simplify it. Right now it’s a fairly direct and naive port from the original.

My thanks to Mike Acton for the original setup code, and to Ken Werner for helping me get my head around the problem and its possible solutions.

Averaging unsigned chars

I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.

Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.

To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get

R = (a>>2) + (b>>2) + (c>>2) + (d>>2);

but R will be as much as 3 less than the actual average.  To fix this up, the average of the lower two bits can be calculated and added back to the result.

L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point
R += (L>>2);   // divide by four and add to the result
R += (L>>1)&1; // round up based on the sum of lower bits

R should be an accurate average of the four uchars, without overflow.

Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));
    return R;
}

(24 instructions – 20ish cycles)

While it is fast enough for my present use, I’m sure it’s ripe for improvement.

qword

spu_µopt[0] // shifting and arrays

There are a couple of mostly-obvious SPU micro-optimisation tricks that I’ve learned. Here’s one.

qword a[128];
qword f(int i) {
    return a[i>>4];
}

Simple enough code. spu-gcc -O3 produces

f:                                                                                                                                                                                                                                                                                                                          
    ila     $2,a                                                                                                                                                                                                                                                                                                        
    andi    $3,$3,-16                                                                                                                                                                                                                                                                                                   
    lqx     $3,$2,$3                                                                                                                                                                                                                                                                                                    
    bi      $lr

And it’s worse if the shift is not 4 – here’s the result of using 5:

f:
    rotmai  $4,$3,-5
    ila     $2,a
    shli    $3,$4,4
    lqx     $3,$2,$3
    bi      $lr

Which is pretty ugly.

The compiler is doing what it has been asked to do, but doesn’t appear to know that SPU load and store instructions force the four rightmost bits of the calculated address to zero.  Which means the andi masking can be removed from the first case, and in the second the shift right and shift left instructions could be replaced with a single shift right – without changing the result.

How to avoid it?  I’ve used manual pointer arithmetic to get around this compiler feature, e.g.

return *(qword*)((uint)a+i)

Which is pretty ugly.

I don’t know what that sort of thing does for the compiler’s alias analysis. Probably nothing good (judicious use of restrict might help…).  si_lqx() would also work, but my  experience – without in-depth examination – is that measured performance gets worse when using load and store intrinsics. I’m not sure if it’s something to do with scheduling, aliasing or programmer competence.

When would this ever matter? I’ve run into this in some very hot lookup table access. Two or four less cycles can make a big difference. Sometimes. (If it really matters, rewriting the whole thing in assembly can work, too)

It’s a micro-optimisation – measure often, use with care.

Building gdb for Cell/B.E.

Trying to debug a bus error on my PS3, I realised that the version of GDB I have installed doesn’t support debugging of SPU programs. There doesn’t seem to be a Debian packaged version available that does, so I built my own.

Because I found no obvious google result, I share this with the zero other people that I expect may one day be interested : the key option for configure appears to be

--enable-targets=spu

This information was brought to you via the gdb.spec file, and a post to the gcc-testresults mailing list.

And now I know – adventures in double precision

Refining the buddhabrot renderer, I’ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah.

To begin with, I lifted code from one of the later revisions of Jeremy’s Mandelbrot renderer. This was written for single precision float, whereas I’ve been working in double precision for this buddhabrot code.  Worth noting on the change from single to double precision –

  • Double precision numbers behave differently to single precision on the SPU (see section 9 of the SPU ISA doc) – I was bitten by infs and NaNs.
  • When browsing that document, I missed the large “Optional v1.2” for instructions like dfcgt. To be clear, the Cell BE SPU does not support this instruction.
  • GCC does include vec_ullong2 spu_cmpgt(vec_double2, vec_double2), but in the absence of dfcgt it takes forty extra instructions to achieve the same result (yeah, that’s what I get for using general intrinsics)

When starting to use double precision, I was expecting much lower performance than single precision on the SPU, but I had not fully understood how much lower – from the Programming Handbook, page 71:

Although double-precision instructions have 13-clock-cycle latencies, on the Cell/B.E. processor, only the final seven cycles are pipelined. No other instructions are dual-issued with double-precision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued.

Ouch.  I knew this, but I didn’t know it – a run of spu_timing on the generated assembly really rammed it home.

0  0123456789012                                      dfs  $75,$45,$44
0   ------7890123456789                               dfma $46,$59,$47
0          ------4567890123456                        dfa  $43,$45,$44
0                 ------1234567890123                 dfa  $42,$80,$75
0                        ------8901234567890          dfm  $32,$46,$46
0                               ------5678901234567   frds $40,$43
0  01234                               ------23456789 dfm  $33,$42,$42
0  012345678901                               ------9 dfm  $36,$42,$81

(Oh, and I’ve noticed again that dfma and friends use RT as an operand, which presumably makes register scheduling even more fun. The above fragment is from a heavily unrolled inner loop.)

At some point, I’ll try to measure the practical difference between double and single precision for this program, to see what (if anything) would be lost by switching over to single precision. Or perhaps there’s some other way around the problem – I’ve been considering fixed point or even multi-single precision fp alternatives.

CellBE Buddhabrot renderer

For my next TUCS tech talk I’ll be continuing on from the Mandelbrot rendering in the last one (which can be seen here) to something a little more complex.

15c2

The Buddhabrot is conceptually not any more complex than the Mandelbrot in terms of its generation – rather than colouring points based on the number of iterations before they ‘escape’, we apply colour to each point reached while iterating escaping starting points.  This has consequences for the drawing of the Buddhabrot – rather than generating one point at a time independently of all other points in the output, iterating a single input point may effect thousands of different output points.  This makes it all trickier when implementing this on the Cell BE – parallel writes by SPEs to shared locations will need some form of synchronisation.  That could be messy, and the process of load/modify/store when expressed in terms of SPU DMA can be quite clumsy.

Rather than try to implement a complex locking/synchronisation system, I have tried to apply some ideas from a set of post-it notes by Mike Acton (you can see them here).  This isn’t identical to Mike’s solution, because it’s not the same problem.

To explain – each SPE thread iterates various points on the screen, and generates a list of points to be written.  This list of points is sent via DMA to a buffer for the SPE’s use the PPE, which proceeds through the list plotting the points to the framebuffer. The advantage of this approach is that there is only one writer to the framebuffer (the PPE), and that each SPE has it’s own buffers to write its data into. The only synchronisation that is necessary is between each SPE and the PPE to ensure that all data in a buffer is consumed before writing more into it.  This is achieved through the use of interrupt mailboxes (SPE tells PPE that there is data), a fenced DMA to act as sentinel (the PPE spins on the arrival of the sentinel data to ensure that DMA of a buffer has completed – this doesn’t feel like the right way to solve this particular problem, though), and the SPE signal register in OR mode to inform the SPE that a particular buffer has been finished with.  Interrupt mailbox events are aggregated through libspe2’s spe_event_*() functions.

It’s not an especially complex piece of code – the motivation in its writing is for my own interest and to use for the tech talk. I think it will do nicely for explaining some of the complexities and curiosities of the Cell BE architecture, and the programming of it with the IBM SDK.

There are a few extra features that I’d like to add – particularly better colouring (including saturation which is unfortunately apparent in its absence), and a number of optimisations to the render_fractal() function that I need to lift from my earlier Mandelbrot efforts.

The program includes code by Jeremy Kerr (See hackfest items at http://ozlabs.org/~jk/diary/tech/cell/) and Mike Acton (framebuffer utilities, from http://cellperformance.beyond3d.com/articles/2007/03/handy-ps3-linux-framebuffer-utilities.html).  My thanks to Jeremy and Mike, and to all those that have offered comments & feedback via twitter.

[edit: Oh, and it includes cheriff’s fine VNC code ;)]

Read the code: fractal.c and spe-fractal.c, or grab a tarball.  Comments & suggestions most welcome.

Addition: I added pixel value saturation and experimented with some alternative approaches to colouring…

5cc(Click for larger version)

Some recent SPU toolchain patches

A couple of patches that I’ve noticed on various mailing lists –

libspe – Some small changes, but includes spe_image_open_library() to load an spe image from a ppe shared library.

binutils – “Also, if DLL’s were supported on SPU…”  Interesting idea – will have to wait to see what comes of it.

gcc – Support non-constants as the second argument of __builtin_expect.  Interesting idea.

(Also, Revital Eres’s function partitioning patch has had some activity, and there’s the odd patch from Alan Modra on overlays and software icache.)