To work out a suitable way to divide a 1080p screen into pieces that a SPU could work effectively with for the diamond-square problem took some trial and error. First, some fundamentals.

### DMA

The SPU interacts with the framebuffer through the use of DMA operations, which have the following characteristics:

- larger transfers are more efficient than smaller ones
- there are alignment constraints
- there can be a tradeoff between performing several small transfers to cope with DMA size/alignment constraints, and fetching a larger aligned area, modifying it and writing it back
- modifying a region of memory requires a DMA get and put operation — extra work, complexity (buffer management and time)
- a single SPU can have a maximum of 16 transfers in-flight at a given time, and will block if any more are issued
- DMA lists can be used to effectively issue more than 16 transfers at a time, but there are greater restrictions on the placement of data in local store

### Data relationships

To calculate each point in diamond-square required the values from four others.

When calculating the centre point of a square, the four corners are located on two different lines and the centre point is written to the centre line between those two. (Reading from two lines, writing to one)

When calculating the centre point of a diamond, the four corners are located on three different input lines and the centre point is located on the same centre line as two of the points (Reading from three lines, writing to one of those)

For a given iteration of diamond-square, all input data has been generated in a previous iteration — that is, no point depends on another point calculated in the same iteration of the algorithm. This gives some flexibility as to the ordering of point calculation.

## First attempt — whole lines

My first thought was to optimise for DMA throughput: large transfers are much more efficient than smaller ones, and the independence of points in a single iteration means that points can be calculated on a line by line basis. Pseudo this:

square(line_number, square_size) top = dma_get(line_number) centre = dma_get(line_number + square_size/2) bottom = dma_get(line_number + square_size) for(i = 0; i < line_length; ++i) calculate_square_centre_value(top, centre, bottom, i, square_size) dma_put(centre)

Whole lines are fetched, which is reasonably efficient (1920×4 bytes = 7680, 7680%128 = 0, so it’s a fine size for efficient DMA), and there’s plenty of scope for optimisation:

- lines may be prefetched
- lines may be reused between iterations: the bottom line for one row of squares is the top line of the next. For diamonds, two lines are reused from one row to the next

I spent quite a bit of time on this approach, to try and get it working to my satisfaction, but in the end I discarded it. The problem for me became one of complexity — the prefetches, special cases (sides, corners) and line reuse were all written and applied for the general-case implementation (i.e. parameterised by square_size like the pseudo code above). This made it extraordinarily cumbersome to write optimised cases for the smallest square sizes — where most of the pixels are calculated and thus most runtime is spent.

In summary, I was optimising the wrong things.

There were also aspects of the implementation that I had not fully considered with this method, and failed to make progress with, particularly proper handling of the side and corner special cases, and animation.

(In hindsight, I suspect I could have simplified this approach, solved the other problems and met the time budget — but probably only knowing what I know from my next attempt.)

## Second attempt — tiles — problems

Tiling was an approach I had considered and discarded early on. Where whole lines are a clear win for DMA efficiency, tiles are much less so — the largest square tiles I could envision were 128×128 pixels, being 64KB (twice for a double buffered implementation), and the 512B line length is not a great performer. Additionally, the limit on the number of in-flight DMA transfers makes this method seem more risky — writing 128 lines per tile requires writing lines as the calculation of each is completed and hoping that computation is slower than DMA, which is probable but not assured.

DMA lists are promising — except that diamond-square requires 2^{n}+1 pixel squares, so it’s actually a 129×129 pixel tile (for calculation), and DMA lists still need to transfer those little extra inner parts between transfers.

On top of that, recall that diamond-square requires a large number of points from outside a tile to be calculated to be able to complete the points within a tile — if stored sparsely, 8 times as much space as for the target tile.

So I didn’t start with this idea, but after discarding the previous approach I was able to work out ways around these problems and implement a basic and functional implementation in the time I spent flying and waiting in airports on my way home from GCAP. I’ll expand on the implementation in a later post.