Diamond-Square — What can you do in 25 cycles?

or “How I got 1080p 60Hz diamond-square plasma from a single SPU”

I am primarily writing this — and subsequent posts — as a record and exercise for myself, and in case there’s anyone who attended the TUCS talk I gave in October that is still interested in the problem and my solution.

I started with a simple (naive) equation:
3.192×109 cycles / (1920×1080 pixels × 60 frames per second) = 25.66 cycles per pixel

That is, to update a 1080p screen at 60Hz, there’s an upper limit of 25 cycles per pixel.

So, what can you do in 25 cycles?

Starting with the simple and obvious: On an SPU, writing to local store takes 6 cycles — that’s a quarter of the time budget gone, right there. DMAing that pixel to the framebuffer in system memory will take hundreds of cycles. Approaching this problem one pixel at a time is clearly not the way to solve it.

Slightly less simple: A write to local store will be six cycles, and it can be pipelined to give four pixels per cycle. A single SPU can do DMA at over 16GB/sec. This should be easy — and it is if all you want to do is write stuff to memory. Throw in some computation and data dependencies and things get a lot more interesting.

So, I chose a problem that is one of my old favorites from Amiga demo days — plasma. There’s a number of ways you can achieve the effect. I chose diamond-square interpolation, which turned out to have some interesting problems to implement for SPU1.

The Diamond-Square Algorithm

The images here show the process of interpolation used for diamond-square (in these diagrams, each interpolation iteration is displayed with a different colour, not the expected calculated colour):

  • Starting with one or more squares, average the corners (with a random perturbation) of the square to calculate the centre point
  • For the diamonds then created, average the corner points (again, with a random perturbation) to calculate their centres
  • Later, rinse, repeat until all the gaps are filled.

Seems simple enough, but there’s some complexity in the implementation.

Data points and space requirements

For a 9×9 starting grid with 9 points defined, only the innermost 3×3 square is fully filled in. The remaining un-coloured (white) points cannot be calculated without having more starting values available (or using some other special case method to appropriately generate the missing information).

Additionally, it can be seen that generate a 3×3 completed pixel region requires the full 9×9 are to start with. Not particularly space efficient.

Data dependency

To calculate a point requires the points around it to be fully calculated, and each of those requires the four surrounding it.  This also has consequences when considering how to divide the problem when trying to calculate a large bitmap using a single SPU and its 256KB local store.


Producing the same image 60 times a second is not very interesting, so there should be some way to cause it to change over time, in some continuous fashion, where each frame relates to the last in a visually pleasing fashion. Doing this in practice becomes a question of where the source “random” values come from, and how they can be varied over time.  Additionally, the use and generation of source values relates to how the output space is divided.

I’ve written previously about some of the issues I encountered when writing this (fast averaging, system calls, poorly performing first frame), and I intend to write about the three areas mentioned above (space requirements, data dependency and animation).

1Additionally, there’s better ways to create the plasma effect. I did it all again, I’d start with Perlin noise, which appears to still have some interesting challenges but is overall better suited to the hardware.


Leave a Reply

Your email address will not be published. Required fields are marked *