diamond-square

Diamond-Square — Randomness, continuity and animation

Randomness is nice, except when you have too much.

It’s easy to blast random pixels all over the screen, but it’s much nicer when there’s some continuity in the randomness.

For diamond-square interpolation, I was looking for a way to generate (and use) random seed values to produce continuous-looking results within and between calculated tiles, as well as varying these seed values over time in a somewhat appealing fashion.

To calculate the values for a given tile requires the corner values of that tile along with the corner values of the eight surrounding tiles — that is, all the blue pixels in the picture are needed to calculate the pixels required to calculate the pixels in the centre tile.

The corner values from one tile effects the calculation of the tiles adjacent. For this reason, the corner values of each tile are state that must remain constant for the entire frame.

To render a full 1920×1080 frame, 15×9 tiles are required, with 16×10 corner values. To calculate the tiles around the frame border, further un-rendered tile data is required to correctly seed each of these border tiles. The result of this is that 18×12 random pixel values are needed to define the full set of tiles for a 1080p frame, which means the seeds for the tiles of a frame can be stored in 18x12x4 = 864 bytes. Not unreasonable.

When rendering each tile, these seed values are first placed in the appropriate 16 locations (taking into account the compressed outer regions), and the interpolation takes place. (Interpolation is performed on the seed values and, later, using only values calculated during a previous iteration — which means there’s no requirement to zero memory between tiles)

The result with just this looks quite reasonable (imho) — there’s decent interpolation of random values across the screen. These random seed values can also be permuted without too much difficulty between frames — I’ve tried some trig-function based methods to give a sense of continuous, directional changes. It works but I’m sure there’s better options.

As for the random perturbing of interpolated values, I’ve tried a few methods and not achieved anything that I’ve been all that happy with. The problem I’ve had is finding/generating a suitably random value of appropriate scale, and applying it to the pixel value fast enough.

(As I reach the end of this, I have the impression that what I’ve put together is almost, but not entirely unlike, Perlin noise — you could possibly say that it has a really clumsy interpolation function. I have a much better understanding of Perlin noise than when I began, and it really looks like I should have tried that. Learning is Good.)

At some stage in the not too distant future, I hope to perform some screencaps and post some actual images and maybe even a video. And fwiw I’ll post the source code, too.

Diamond-Square — DMA lists and dirty tricks

There’s a hole in my dataset, dear MFC, dear MFC…

There’s a block of local store that holds a tile of pixels that need to be DMAd out to main memory. 128 lines makes performing many smaller transfers more complex , so we’ll try DMA lists why not. The problem here is that the pixel data for each line does not match up directly with the next — between each is stored the extra lines of extra-tile information that was needed for the calculation. When using DMA lists to transfer data to many memory, the source for each DMA transfer starts where the previous one ended. This means that between each DMA there’s (148-128)×4=80 bytes of information that we don’t want to see on the screen.

There’s a lot of things that are “best-practice” for SPU DMA, particularly relating to size and alignment of transfers, and I’m typically a religious adherent. In this case, I did not comply with best practice and still met my time budget and only feel a small amount of shame :P

Overall, less time is spent performing DMA than is spent calculating pixels, so further optimising DMA for this particular program is unnecessary.

When transferring tiles from the SPU to the framebuffer, there’s three cases of particular interest:

Tiles on the right edge of the screen
Tiles on the lower edge of the screen
The Other Tiles

3. The Other Tiles

These are the easy ones. They will never need to be trimmed to fit the screen edges, and if drawn in the right order have the wonderful characteristic that the extra data needed can be DMAd to the framebuffer and will be overwritten by pixel data from a later tile. There’s no special case needed, just draw each pixel line and — all data between pixel lines — to the screen.

Tile data drawn to screen

(For the diagrams, the amount of overdraw is far greater than the actual tile part — this is a consequence of the small size. It’s a much smaller percentage for 128 pixel tiles. I’ll post some actual screengrabs here sometime…)

1. Tiles on the right edge of the screen

Overdraw isn’t the answer in this case. It is not possible to overdraw on either the left or right of the rightmost tile in a way that will be correct when the screen is finished. Instead, the extra information (including any portion that may not fit onto the visible screen) must be dealt with some other way.

My solution, ugly as it is, is to write each surplus portion of a tile to a scratch location in memory — every one of them to the same location. It works :|

2. Tiles on the lower edge of the screen

These tiles are really just like the others, except they’ll stop a few lines short. They’re still fully calculated, but only the visible lines are transferred.

(In hindsight, increasing the spacing between lines would help reduce the alignment and size problem here. Adding an extra 48 bytes to each tile would allow every transfer to be optimally aligned and sized. And would probably make no measurable difference to the total runtime. Heck, there’s probably enough time to repack the data in local store before performing the DMA. There’s not that much…)

Diamond-Square — Tiled rendering and space

I started with the assumption of 128×128 pixel diamond-square tiles — they’re a power of two big (which makes them simpler to calculate), 512 bytes wide, 128 lines tall and 64KB in total size, which means two will fit in local store.

So, questions: What points outside this area are required to calculate all the points within? How much space is required to render one of these tiles?

(There’s a smaller example pictured, again with a different colour for each iteration of the calculation)

If the tile is to be 4×4, the centre area is 5×5 with overlap between tiles. Additionally, to calculate a 4×4 tile, fully calculating a 7×7 area is required, not counting the sparsely calculated pixels around the outside.

Scaling this up to 128×128 pixel tiles and it should be clear that at least 131×131 pixel area is required, not counting those extras around the outside. How to deal with those?

One thing that is clear is that they’re sparsely filled, and that calculated values only appear on certain lines. In fact, the number of lines required around the outside of a tile is (1+log₂n) where n is the tile width. For the 4×4 tile pictured, three extra lines are needed on each side of the tile. For a tile 128 lines across, 8 extra lines are needed.

So all the values to calculate a given tile can be stored in in (129+8+8)×(129+8+8) = 145×145 pixels. Rounding up to get quadword aligned storage gives a total of 148×145×4 bytes — 85,840 bytes per tile. 171,680 bytes is a big chunk of local store, but it’s probably better to be using it for something…

Based on that, it’s not particularly difficult to generate the data for a tile. Start the tile with its starting values (more on those in another post) in the corners of the squares (the blue ones in the picture), and then calculate the remaining points. There’s some special cases required for each of the sides, but it’s nothing particularly tricky, just keeping track of row/column indexes for particular iterations.

The outer points make up only a very small percentage of the total points that need to be calculated — more than 50% of the points are calculated in the last diamond and square iterations, and due to the proximity and alignment of the points accessed, their calculation can be usefully optimised. (I might write a post about that sometime…)

Diamond-Square — Data dependency and SPU

To work out a suitable way to divide a 1080p screen into pieces that a SPU could work effectively with for the diamond-square problem took some trial and error. First, some fundamentals.

DMA

The SPU interacts with the framebuffer through the use of DMA operations, which have the following characteristics:

larger transfers are more efficient than smaller ones
there are alignment constraints
there can be a tradeoff between performing several small transfers to cope with DMA size/alignment constraints, and fetching a larger aligned area, modifying it and writing it back
modifying a region of memory requires a DMA get and put operation — extra work, complexity (buffer management and time)
a single SPU can have a maximum of 16 transfers in-flight at a given time, and will block if any more are issued
DMA lists can be used to effectively issue more than 16 transfers at a time, but there are greater restrictions on the placement of data in local store

Data relationships

To calculate each point in diamond-square required the values from four others.

When calculating the centre point of a square, the four corners are located on two different lines and the centre point is written to the centre line between those two. (Reading from two lines, writing to one)

When calculating the centre point of a diamond, the four corners are located on three different input lines and the centre point is located on the same centre line as two of the points (Reading from three lines, writing to one of those)

For a given iteration of diamond-square, all input data has been generated in a previous iteration — that is, no point depends on another point calculated in the same iteration of the algorithm. This gives some flexibility as to the ordering of point calculation.

First attempt — whole lines

My first thought was to optimise for DMA throughput: large transfers are much more efficient than smaller ones, and the independence of points in a single iteration means that points can be calculated on a line by line basis. Pseudo this:

square(line_number, square_size)
    top = dma_get(line_number)
    centre = dma_get(line_number + square_size/2)
    bottom = dma_get(line_number + square_size)
    for(i = 0; i < line_length; ++i)
        calculate_square_centre_value(top, centre, bottom, i, square_size)
    dma_put(centre)

Whole lines are fetched, which is reasonably efficient (1920×4 bytes = 7680, 7680%128 = 0, so it’s a fine size for efficient DMA), and there’s plenty of scope for optimisation:

lines may be prefetched
lines may be reused between iterations: the bottom line for one row of squares is the top line of the next. For diamonds, two lines are reused from one row to the next

I spent quite a bit of time on this approach, to try and get it working to my satisfaction, but in the end I discarded it. The problem for me became one of complexity — the prefetches, special cases (sides, corners) and line reuse were all written and applied for the general-case implementation (i.e. parameterised by square_size like the pseudo code above). This made it extraordinarily cumbersome to write optimised cases for the smallest square sizes — where most of the pixels are calculated and thus most runtime is spent.

In summary, I was optimising the wrong things.

There were also aspects of the implementation that I had not fully considered with this method, and failed to make progress with, particularly proper handling of the side and corner special cases, and animation.

(In hindsight, I suspect I could have simplified this approach, solved the other problems and met the time budget — but probably only knowing what I know from my next attempt.)

Second attempt — tiles — problems

Tiling was an approach I had considered and discarded early on. Where whole lines are a clear win for DMA efficiency, tiles are much less so — the largest square tiles I could envision were 128×128 pixels, being 64KB (twice for a double buffered implementation), and the 512B line length is not a great performer. Additionally, the limit on the number of in-flight DMA transfers makes this method seem more risky — writing 128 lines per tile requires writing lines as the calculation of each is completed and hoping that computation is slower than DMA, which is probable but not assured.

DMA lists are promising — except that diamond-square requires 2ⁿ+1 pixel squares, so it’s actually a 129×129 pixel tile (for calculation), and DMA lists still need to transfer those little extra inner parts between transfers.

On top of that, recall that diamond-square requires a large number of points from outside a tile to be calculated to be able to complete the points within a tile — if stored sparsely, 8 times as much space as for the target tile.

So I didn’t start with this idea, but after discarding the previous approach I was able to work out ways around these problems and implement a basic and functional implementation in the time I spent flying and waiting in airports on my way home from GCAP. I’ll expand on the implementation in a later post.

Diamond-Square — What can you do in 25 cycles?

or “How I got 1080p 60Hz diamond-square plasma from a single SPU”

I am primarily writing this — and subsequent posts — as a record and exercise for myself, and in case there’s anyone who attended the TUCS talk I gave in October that is still interested in the problem and my solution.

I started with a simple (naive) equation:
3.192×10⁹ cycles / (1920×1080 pixels × 60 frames per second) = 25.66 cycles per pixel

That is, to update a 1080p screen at 60Hz, there’s an upper limit of 25 cycles per pixel.

So, what can you do in 25 cycles?

Starting with the simple and obvious: On an SPU, writing to local store takes 6 cycles — that’s a quarter of the time budget gone, right there. DMAing that pixel to the framebuffer in system memory will take hundreds of cycles. Approaching this problem one pixel at a time is clearly not the way to solve it.

Slightly less simple: A write to local store will be six cycles, and it can be pipelined to give four pixels per cycle. A single SPU can do DMA at over 16GB/sec. This should be easy — and it is if all you want to do is write stuff to memory. Throw in some computation and data dependencies and things get a lot more interesting.

So, I chose a problem that is one of my old favorites from Amiga demo days — plasma. There’s a number of ways you can achieve the effect. I chose diamond-square interpolation, which turned out to have some interesting problems to implement for SPU¹.

The Diamond-Square Algorithm

The images here show the process of interpolation used for diamond-square (in these diagrams, each interpolation iteration is displayed with a different colour, not the expected calculated colour):

Starting with one or more squares, average the corners (with a random perturbation) of the square to calculate the centre point
For the diamonds then created, average the corner points (again, with a random perturbation) to calculate their centres
Later, rinse, repeat until all the gaps are filled.

Seems simple enough, but there’s some complexity in the implementation.

Data points and space requirements

For a 9×9 starting grid with 9 points defined, only the innermost 3×3 square is fully filled in. The remaining un-coloured (white) points cannot be calculated without having more starting values available (or using some other special case method to appropriately generate the missing information).

Additionally, it can be seen that generate a 3×3 completed pixel region requires the full 9×9 are to start with. Not particularly space efficient.

Data dependency

To calculate a point requires the points around it to be fully calculated, and each of those requires the four surrounding it. This also has consequences when considering how to divide the problem when trying to calculate a large bitmap using a single SPU and its 256KB local store.

Animation

Producing the same image 60 times a second is not very interesting, so there should be some way to cause it to change over time, in some continuous fashion, where each frame relates to the last in a visually pleasing fashion. Doing this in practice becomes a question of where the source “random” values come from, and how they can be varied over time. Additionally, the use and generation of source values relates to how the output space is divided.

I’ve written previously about some of the issues I encountered when writing this (fast averaging, system calls, poorly performing first frame), and I intend to write about the three areas mentioned above (space requirements, data dependency and animation).

¹Additionally, there’s better ways to create the plasma effect. I did it all again, I’d start with Perlin noise, which appears to still have some interesting challenges but is overall better suited to the hardware.

Frame times are like pancakes

Time to generate image data, no DMA: 0.015962s

Time to generate data, with DMA to framebuffer: 0.048529s. That’s not good.

Total amount of data transferred: one 1080p frame — ~8MB.

Measured average latency of 4*16KB DMA puts: 577µs — ~1000 times slower than expected. Something’s really not right.

…

Oh.

First accesses to memory will be more expensive.

Time to generate and DMA first frame: 0.048529s

Time to generate and DMA second frame: 0.016370s

Time to generate and DMA subsequent frames: 0.015944s — slightly faster than with no DMA.

There’s still plenty of work to be done before it’s functional and interesting, but I may yet hit my goal of 60 fps for this project :)

Averaging more unsigned chars

Some further thoughts, continuing on from my previous average post…

Alternate methods

sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo. sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be all that would be needed to turn the result from two sumb calls into the desired averages.

Unfortunately, the input format I’m using isn’t well suited to sumb and it would appear to require a prohibitive number of shuffles to prepare the data appropriately – that said, there’s at least one place where I shuffle data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.

The avgb instruction calculates the average of the bytes in two quadwords. It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there’s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I’ve not yet been able to wrap my head around the correct method to do so.

Approximating

There are plenty of ways to very quickly get a result that is often — but not always — correct (like avgb). One of these methods may be suitable for my particular needs, but I won’t know until later. My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.

Adding

I’m annoyed with myself that I missed this one, as I’ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));

adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // add 2
    L = si_a(L, si_ilh(0x202));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    return R;
}

The difference is pretty minor — a couple of instructions and (when not inlined) it’s no faster. For the program it’s used in I’m seeing around a 1.5% runtime reduction.

ioctl() and the spu — a tale of many magics

I’d achieved a working implementation of the small spu project I’ve been messing around with and wanted to move on to start pushing particular pixels (rather than debugging in text mode). My program, up to this point, has been a simple spu-only piece of code (launched via the elfspe loader under Linux), and I was reluctant to tack on a some kind of ppu manager just to set up the framebuffer — so, I though, why not do it from the spu itself? (Why not? Because it’s a stupid idea and so on.)

What’s needed to “set up the framebuffer”? Magic. Fortunately, the magic has been well documented by Mike Acton and you can get the source here. The trick becomes merely to get those files compiled for the spu, and the biggest barrier to doing that is the ioctl() system call.

There was no existing way to call ioctl() directly from the spu and several possible approaches were apparent.

Callback handler in libspe

libspe provides a number of callback functions for spu programs, allowing them some degree of access to and control of the wider system. These are built on the __send_to_ppe() function found in newlib. __send_to_ppe() packs up a function’s arguments and stops the spu, signaling to libspe which of several handlers should be utilised.

My first partially-working attempt to access ioctl() used this approach, and once I’d worked out how to debug my way into libspe and back to the spu the solution was quite straightforward. Particularly, libspe mmaps the spu’s local store, making it possible to pass a pointer to the spu’s local store directly to ioctl().

The third argument to ioctl() can be a pointer or a value and so some decoding of the ‘request’ argument would be required to handle particular cases correctly. While that’s probably not too difficult (probably — I’m not clear on the details), it’s beyond what I was interested in pursuing. For syscalls, though, there is a ‘better’ way…

Stopcode 0x2104

With __send_to_ppe(), newlib provides a similar __linux_syscall() function. I couldn’t make much sense of it when I first looked at it as the 0x2104 stopcode it uses is not intercepted by libspe. Instead, it is magic and is intercepted directly by the kernel for the express purpose of handling syscalls. Hooray?

Almost. If the third argument to ioctl() is a pointer, it needs to contain a valid ea. Unfortunately, there appears to be no sane way for a spu to mmap it’s own local store, meaning an lsa cannot be converted to a valid ea without some external assistance, and so we must consider other options…

__ea

Named address spaces are an extension to the C language that provide a way to access eas from a spu in a somewhat transparent fashion. On the spu, these work by qualifying pointers with __ea to indicate that they contain an ea, and the complier handles the intervening magic (gcc-4.5 and the various versions of gcc and xlc that have shipped with the IBM SDK have support for this extension).

While this method might work, I didn’t want to give up a chunk of my local store for the software cache that the implementation in gcc uses — I have a hunch that I’ll will almost run out of space when I add some extra DMA buffers to this program. (And I’m not sure what’s needed to get __ea storage working correctly without a separate ppu context).

A more convoluted method

In the absence of a neater approach, here’s how I got it working:

mmap some scratch space in main memory (using mmap_eaddr() from newlib)
call ioctl() with the address of that scratch space
DMA the data from the scratch space to local store

Something like:

void* scratch = (void*)(uint32_t)mmap_eaddr(0ULL, size, PROT_READ|PROT_WRITE,
                                            MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ioctl(fb_fd, FBIOGET_VBLANK, scratch);
struct fb_vblank vblank;
spu_mfcdma32(&vblank, (uint32_t)scratch, ROUNDUP16(sizeof(vblank)),
             0, MFC_GET_CMD);
mfc_write_tag_mask(1);
spu_mfcstat(MFC_TAG_UPDATE_ALL);

And so, ugly as it is, I can now call ioctl() from spu code. Lather, rinse & repeat as desired.

Of course, there were a few other problems to solve: because I was compiling cp_{fb,vt}.c for the spu but using the (ppu) system headers to get the fb defns, the compiler sees the prototypes for the system mmap, munmap, printf and snprintf functions. Extra wrappers are needed for each of those (and so I now know about the v*printf family).

Once all that is done, it works. I can push pixels to the screen from the spu.

Flipping and vsync is an interesting, and slightly awkward, case as they are passed the address of a value, not the value itself (for no readily apparent reason), so the values are allocated early on and referenced as needed.

The implementation can be found here. It’s functional. :) There are many opportunities to tidy it up and simplify it. Right now it’s a fairly direct and naive port from the original.

My thanks to Mike Acton for the original setup code, and to Ken Werner for helping me get my head around the problem and its possible solutions.

Averaging unsigned chars

I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.

Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.

To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get

R = (a>>2) + (b>>2) + (c>>2) + (d>>2);

but R will be as much as 3 less than the actual average. To fix this up, the average of the lower two bits can be calculated and added back to the result.

L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point
R += (L>>2);   // divide by four and add to the result
R += (L>>1)&1; // round up based on the sum of lower bits

R should be an accurate average of the four uchars, without overflow.

Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));
    return R;
}

(24 instructions – 20ish cycles)

While it is fast enough for my present use, I’m sure it’s ripe for improvement.

qword