hbr – Page 9 – hint for Branch (r-form)

Averaging unsigned chars

I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.

Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.

To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get

R = (a>>2) + (b>>2) + (c>>2) + (d>>2);

but R will be as much as 3 less than the actual average. To fix this up, the average of the lower two bits can be calculated and added back to the result.

L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point
R += (L>>2);   // divide by four and add to the result
R += (L>>1)&1; // round up based on the sum of lower bits

R should be an accurate average of the four uchars, without overflow.

Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));
    return R;
}

(24 instructions – 20ish cycles)

While it is fast enough for my present use, I’m sure it’s ripe for improvement.

qword

“Layers are Leaking”

Andrew Richards (@codeandrew), the CEO of Codeplay posted an article “Layers are Leaking”, which inspired some thoughts – tangential and perpendicular – beyond what I could reasonably fit into a stream of tweets (not that that stopped me trying).

From the article:

If you look at any computer system architecture diagram (and there are many) you will almost always see a series of “layers” on top of each other. Each layer is based on a layer underneath. It is a very common pattern in computers, and I believe that it is fundamental to the success of the development of computer technology. What layers allow you to do, is let individual people concentrate on only one part of a problem.

This brought to mind the first chapter of Bell & Newell’s Computer Structures book from 1971, which illustrates this kind of layering idea. Additionally, I was reminded of Jeannette Wing’s excellent Computational Thinking article which (amongst other things) makes the case that this type of thinking, associated with computer science, is something that is worth teaching and applying more widely.

But, recently, the layers have started leaking.

I don’t really agree that this is a recent thing (layers leak, abstractions change – some leaks are more obvious than others), but the article refers to the significant changes that have come about in the eternal quest for faster computation, and the difficulties involved in replacing established abstractions. Renegotiating the interface between two layers is difficult when the people responsible for each layer don’t speak the same technical language. What’s worse is trying to change the interface to one layer without adequately considering the consequences to other layers in the system. I think time-consuming discussion and debate is inevitable, and it is not inherently a reason to avoid changing an otherwise broken abstraction.

A number of the problems (or – dare I cliché it – opportunities) that are appearing with regard to increasingly parallel computation are not new. Consider Peter Denning’s recent article. I find it interesting that a number of the suggested solutions involve changes to deeply ingrained abstractions (like programming languages) for which there would be (is) significant resistance to change.

I read an interview with Fred Brooks recently, and the following quote stood out:

Software is not the exception; hardware is the exception. No technology in history has had the kind of rapid cost/performance gains that computer hardware has enjoyed. Progress in software is more like progress in automobiles or airplanes: We see steady gains, but they’re incremental.

Which echoes something that Andrew said in another article (“Why Intel Larrabee Really Stumbled” – google cache is here)

Infinite Truth: Software takes longer to develop than Hardware

A lot of time has been spent on steady, incremental gains for software that attempts to work around the limitations of many common layer abstractions – examples that spring to mind include compiling general purpose programs for small register sets, converting scalar programs to utilise vector instructions, and trying to make parallelism out of inherently single threaded code. We can’t necessarily redesign our systems to make these problems go away, but I think we can strive to address these problems in the best way possible – which will likely involve reshuffling some layers in the process. It’s likely to be messy, time consuming, and require a lot of hard work.

Sounds like fun :)

My thanks to Andrew for his thought-provoking article.

Additional:
Stephen Hill (@self_shadow) tweeted links to wikipedia on Leaky abstractions, and a related article by Joel Spolsky.

But, recently, the layers have started leaking.

spu_µopt[0] // shifting and arrays

There are a couple of mostly-obvious SPU micro-optimisation tricks that I’ve learned. Here’s one.

qword a[128];
qword f(int i) {
    return a[i>>4];
}

Simple enough code. spu-gcc -O3 produces

f:                                                                                                                                                                                                                                                                                                                          
    ila     $2,a                                                                                                                                                                                                                                                                                                        
    andi    $3,$3,-16                                                                                                                                                                                                                                                                                                   
    lqx     $3,$2,$3                                                                                                                                                                                                                                                                                                    
    bi      $lr

And it’s worse if the shift is not 4 – here’s the result of using 5:

f:
    rotmai  $4,$3,-5
    ila     $2,a
    shli    $3,$4,4
    lqx     $3,$2,$3
    bi      $lr

Which is pretty ugly.

The compiler is doing what it has been asked to do, but doesn’t appear to know that SPU load and store instructions force the four rightmost bits of the calculated address to zero. Which means the andi masking can be removed from the first case, and in the second the shift right and shift left instructions could be replaced with a single shift right – without changing the result.

How to avoid it? I’ve used manual pointer arithmetic to get around this compiler feature, e.g.

return *(qword*)((uint)a+i)

Which is pretty ugly.

I don’t know what that sort of thing does for the compiler’s alias analysis. Probably nothing good (judicious use of restrict might help…). si_lqx() would also work, but my experience – without in-depth examination – is that measured performance gets worse when using load and store intrinsics. I’m not sure if it’s something to do with scheduling, aliasing or programmer competence.

When would this ever matter? I’ve run into this in some very hot lookup table access. Two or four less cycles can make a big difference. Sometimes. (If it really matters, rewriting the whole thing in assembly can work, too)

It’s a micro-optimisation – measure often, use with care.

Building gdb for Cell/B.E.

Trying to debug a bus error on my PS3, I realised that the version of GDB I have installed doesn’t support debugging of SPU programs. There doesn’t seem to be a Debian packaged version available that does, so I built my own.

Because I found no obvious google result, I share this with the zero other people that I expect may one day be interested : the key option for configure appears to be

--enable-targets=spu

This information was brought to you via the gdb.spec file, and a post to the gcc-testresults mailing list.

20100319 photographs

On Australia Day this year I climbed to the summit of Cradle Mountain with some friends. Here are some of the photos that I took before the clouds rolled in and the rain came down.

It’s a great place to take a long walk :)

Photos taken on 20100126. Click any image for a giant version.

“Wombat Pool”

The mountain. You could see the top at this stage in the day.

All photos are licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Australia License.

“premature optimization”

Here is an except from Donald Knuth’s 1974 paper “Structured Programming with go to Statements”. There is a well known quote at the end of the second paragraph :

The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should no pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I’ve become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

The quote to which I am referring – that “premature optimization is the root of all evil” – is, from time to time, used fait accompli to assert that any time spent optimising is time wasted – something that is clearly not the case when considering the quote in context.

There are a lot of other good ideas in Knuth’s article, many still quite relevant to software engineering. The full article (which can be obtained via citeseerx) is, primarily, a contribution to a discussion of the time on the use of go to statements – another topic often reduced to a simple, extreme position, attributed to Edsger Dijkstra. Knuth’s paper makes it clear that Dijkstra did not hold so tightly to that position himself.

This is the sort of paper that I find to be a fascinating read, with both historical and technical value.

The final page of the article contains the following quote :

A disturbingly large percentage of the students ran into situations that require go to‘s, and sure enough, it was often because while didn’t work well to their plan, but almost invariably because their plan was poorly thought out.

Which brought to mind another quote that I had seen recently, this one from George Orwell :

A man may take to drink because he feels himself to be a failure, and then fail all the more completely because he drinks. It is rather the same thing that is happening to the English language. It becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

It seems quite valid to say that there is a connection between our language and our expression (in terms of ideas, spoken word, programs, and more).

When trying to find a source for Orwell’s words (which you can find here), I was heartened to find these sentences follow those quoted above :

The point is that the process is reversible. Modern English, especially written English, is full of bad habits which spread by imitation and which can be avoided if one is willing to take the necessary trouble.

Which I believe is also applicable to the expression and optimisation of computer programs.

Some Linux on PS3 updates

I’ve updated the Linux distros page with links to articles about installing Ubuntu 9.10 [Joep Cremers Weblog] and Fedora 12 [Geoff Levand, Ken Werner] on the Playstation 3.

In mosty unrelated news,

IBM will not be releasing a previously planned successor to the PowerXCell 8i [heise online, Playstation University]
(neither article conveys particularly clear information to me – not least because I have only a machine translation of the first one)
Playstation 3 firmware version 3.10 seems to have added ABC iView and BBC iPlayer support to the XMB (the presence of which *may* be related to your PSN region, geographical restrictions to accessing video content still seem to apply). If you can’t see it in the TV column of your XMB, try rebooting…

PS3, Fedora 11 and OpenCL

Ken Werner has written a nice guide to installing Fedora 11 on the PS3. (And I’ve added the link to the Linux distro page).

Inside Ken’s page is some details on installing and running demos from the recent “OpenCL Development Kit for Linux on Power”, which is something I’m now even more keen to try :)

(It seems a little odd that the only place that OpenCL SPE acceleration is mentioned is in one of the FAQs…)

20091025 photographs

Over my back fence is a little patch of greenery that I’m quite certain doesn’t look as consistently green as this any more.

View Larger Map

The were some heavy winds across the state a couple of weeks back – we experienced a small amount of superficial damage to the house, but the trees down the hill (leading down to Fern Glade) suffered a great deal. There are a lot of broken branches, some trees snapped in half, while some of the largest have been tipped over.

Last week I wandered down the hill with the rest of the family to take a closer look at the extent of what had happened. Unfortunately, I don’t have any ‘before’ shots to give a better context for the photos, but in part it feels like a sizable section has been clear-felled by the wind.

(Photos were taken on 20091017)

Click images for larger version.

All photos are licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Australia License.

sixaxisd and N900? Perhaps…

My previous post was mentioned over at maemo.org, with some speculation as to whether sixaxisd might work on the Nokia N900.

My initial thought was that with a small modification to sixpair, it would be possible to feed it an arbitrary BD address, so I started poking around in the source – and found that the feature is already there :D

You need the BD address of the device that you want to pair with the sixaxis – get this with hciconfig – run it (on the N900 – or whatever system) and you’ll hopefully see something like this (I don’t have an N900) :

# hciconfig
hci0:   Type: USB
        BD Address: 00:1A:80:2C:BE:D0 ACL MTU: 384:8 SCO MTU: 64:8
        DOWN
        RX bytes:87170 acl:1503 sco:0 events:62 errors:0
        TX bytes:574 acl:22 sco:0 commands:21 errors:0

Plug your sixaxis into some other machine via USB, grab (and compile) sixpair.c and run it with the BD address

wget http://www.pabr.org/sixlinux/sixpair.c
gcc sixpair.c -lusb -o sixpair
./sixpair 00:1A:80:2C:BF:D3 # use your own BD addr here

And that should be about it – unplug the sixaxis, continue with the instructions from the previous post and when you press the PS button, everything should work as if by magic.

If it does not, get more magic. Either way, leave a comment and let me know how it goes :)