Assembly Primer Part 2 — Memory Organisation — SPU

These are my notes for where I can see SPU varying from ia32, as presented in the video Part 2 — Virtual Memory Organization.

(I didn’t notice see any significant differences between the presented information for ia32 and PPC — apart from what was noted from the first presentation — so there’s no separate post for that arch).

To compile SimpleDemo.c to examine on the SPU, you’ll need to add the -mstdmain option to spu-gcc (or spu-elf-gcc) so that the program will correctly receive the command line options.

If you examine the /proc/$PID/maps file when running a standalone SPU program, you’ll see something like this:

00100000-00120000 r-xp 00000000 00:00 0       [vdso]
0fd70000-0fd90000 r-xp 00000000 fe:02 1590608 /lib/libgcc_s.so.1
0fd90000-0fda0000 rw-p 00010000 fe:02 1590608 /lib/libgcc_s.so.1
0fdb0000-0fdd0000 r-xp 00000000 fe:02 292441  /lib/libpthread-2.11.2.so
0fdd0000-0fde0000 rw-p 00010000 fe:02 292441  /lib/libpthread-2.11.2.so
0fdf0000-0fe00000 r-xp 00000000 fe:02 292418  /lib/librt-2.11.2.so
0fe00000-0fe10000 rw-p 00000000 fe:02 292418  /lib/librt-2.11.2.so
0fe20000-0ff90000 r-xp 00000000 fe:02 292437  /lib/libc-2.11.2.so
0ff90000-0ffa0000 rw-p 00160000 fe:02 292437  /lib/libc-2.11.2.so
0ffa0000-0ffb0000 rw-p 00000000 00:00 0
0ffc0000-0ffe0000 r-xp 00000000 fe:02 1590211 /usr/lib/libspe2.so.2.2.80
0ffe0000-0fff0000 rw-p 00010000 fe:02 1590211 /usr/lib/libspe2.so.2.2.80
10000000-10010000 r-xp 00000000 fe:02 1821445 /usr/bin/elfspe
10010000-10020000 rw-p 00000000 fe:02 1821445 /usr/bin/elfspe
10020000-10050000 rwxp 00000000 00:00 0       [heap]
f7f60000-f7f70000 rw-p 00000000 00:00 0
f7f70000-f7fb0000 rw-s 00000000 00:13 9086
                                       /spu/spethread-2971-268566640/mem
f7fb0000-f7fc0000 rw-p 00000000 fe:02 1463963
                     /home/jonathan/AssemblyLanguagePrimer/SimpleDemoSPU
f7fc0000-f7fe0000 r-xp 00000000 fe:02 292430  /lib/ld-2.11.2.so
f7fe0000-f7ff0000 rw-p 00020000 fe:02 292430  /lib/ld-2.11.2.so
ffea0000-ffff0000 rw-p 00000000 00:00 0       [stack]

This is the information for the elfspe loader for the SPU program.

(The SPU’s local store is mapped into elfspe’s address space at 0xf7f7000.  This is with randomize_va_space set to zero, so it should always be in that location. This is possibly useful…)

There is no equivalent of this for the SPU program itself as there is no virtual memory mapping required (or possible) within the local store.  The state of the SPU’s memory state may be examined externally through the spufs interface provided (in this case, the file /spu/spethread-2971-268566640/mem from the above listing may be used to access the current SPU LS state). Or, of course, using gdb.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU

Assembly Primer Part 1 — System Organization — SPU

The platform I’m using is Debian Sid on a PS3 (3.15 OtherOS) with the spu-gcc toolchain.

These are my notes for where I can see the SPU varying from the ia32, as presented in the video Part 1 — System Organization.  Let me know if I’ve missed something important, obvious or got something wrong.

For reference, I’m using the SPU ABI and ISA docs.

General Purpose Registers

  • 128 128bit registers, treated as different data types depending on the instruction used.
    • r0 (LR) — Return Address / Link Register
    • r1 (SP) — Stack pointer information.
      • Word 0 — current stack pointer (always 16-byte aligned, grows down)
      • Word 1 — bytes of available stack space
    • r2 — Environment pointer (for languages that use one)
    • r3–r74 — First 72 qwords of a function’s argument list and its return value
    • r75–r79 — Scratch registers
    • r80–r127 — Local variable registers.  Preserved across function calls.
  • FPSCR — Floating-Point Status and Control Register
  • Channels — Used for various DMA operations, access to the decrementer, mailboxes and signalling.
  • SRR0 — Used to store the address of next instruction upon interrupt
  • LSLR — Local Store Limit Register.  0x0003ffff == 218-1 == 262143

Memory model

  • .text at address 0
  • Bottom of stack at 0x3ffff, effectively earlier if using -mstdmain.  (at least, afaict — could look more closely at how -mstdmain actually works…)

Assembly Primer Part 1 — System Organization — PPC

I found the videos introducing assembly language here to be of interest as my own understanding and experience with assembly is quite limited.  The videos are focussed on the ia32 architecture and reverse engineering, particularly for security exploits, and while these aspects don’t really excite me, the videos are well made and clear and the “Assembly Primer for Hackers” videos cover general assembly programming and details of machine architecture that are more broadly applicable.

I thought it would be interesting to work from these videos and make some notes on applying the concepts to the Cell BE’s PPU and SPU.

The platform I’m using is Debian Sid on a PS3 (3.15 OtherOS) with the standard system toolchain.

These are my notes for where I can see the PPU varying from the ia32, as presented in the video Part 1 — System Organization.  Let me know if I’ve missed something important, obvious or got something wrong.

For reference, I’m using the PPC Architecture Books (also found in the Cell SDK) and documentation for the PPC64 ABI.

Registers

Branch Processor

  • CR Condition Register — 32-bit. Provides a mechanism for testing (and branching). Eight 4-bit fields.
    • CR0–CR1 — Volatile condition code register fields
    • CR2–CR4 — Nonvolatile condition code register fields
    • CR5–CR7 — Volatile condition code register fields
  • LR Link Register (volatile) — 64-bit.  Can be used to provide branch target address for Branch Conditional to Link Register instruction
  • CTR Count Register (volatile) — 64-bit.  Can be used to hold a loop count that can be decremented during execution of Branch instructions containing appropriately coded BO field.  Also can be used to provide branch target address for the Branch Conditional to Count Register instruction.

Fixed-Pt Processing

  • GPR0–GPR31 — 64-bit General Purpose registers. Byte, halfword, word or doubleword, depending on instruction flags.  Supports byte, halfword, word, doubleword operand fetches and stores to storage.
    • r0 — Volatile register used in function prologs
    • r1 — Stack frame pointer
    • r2 — TOC pointer
    • r3 — Volatile parameter and return value register
    • r4–r10 — Volatile registers used for function parameters
    • r11 — Volatile register used in calls by pointer and as an environment pointer for languages which require one
    • r12 — Volatile register used for exception handling and glink code
    • r13 — Reserved for use as system thread ID
    • r14–r31 — Nonvolatile registers used for local variables
  • XER Fixed-Point Exception Register (volatile) — 64-bit.  Book I, p 32 has the details on this one.
  • MSR Machine State Register — 64-bit.  Defines the state of the processor. See Book III, p 10.

Float-Pt Processing

  • FPR0–FPR31 Floating-Point Registers — 64-bit. Single or double precision, depending on instruction flags.  Supports word and doubleword operand fetches and stores to storage.
    • f0 — Volatile scratch register
    • f1–f4 — Volatile floating point parameter and return value registers
    • f5–f13 — Volatile floating point parameter registers
    • f14–f31 — Nonvolatile registers
  • FPSCR Floading-Point Status and Control Register (volatile) — 32-bit. Status and control bits. See Book1, pp 87–89.

VMX

  • v0–v1 — Volatile scratch registers
  • v2–v13 — Volatile vector parameters registers
  • v14–v19 — Volatile scratch registers
  • v20–v31 — Non-volatile registers
  • vrsave — Non-volatile 32-bit register

Privileged

  • SRR0 & SRR1 Machine Status Save/Restore registers — 64-bit. Used to store machine state when an interrupt occurs.
  • DAR Data Address Register — 64-bit. Set by various interrupts to effective address associated with interrupt.
  • DSISR Data Storage Interrupt Status Register — 32-bit. Set to indicate the cause of various interrupts.
  • SPRG0–SPRG3 Software-use SPRs — 64-bit.  For use by privileged software.
  • CTRL Control Register — 32-bit. Controls an external I/O pin.
  • PVR Processor Version Register — 32-bit. Read-only.
  • PIR Processor Identification Register — 32-bit.

(there’s some hypervisor regs not listed here, and probably others…)

Virtual Memory Model

  • Default/standard location for .text appears to be 0x1000000
  • stack starts at 0xffffffff

Frame times are like pancakes

Time to generate image data, no DMA: 0.015962s

Time to generate data, with DMA to framebuffer: 0.048529s.  That’s not good.

Total amount of data transferred: one 1080p frame — ~8MB.

Measured average latency of 4*16KB DMA puts: 577µs — ~1000 times slower than expected.  Something’s really not right.

Oh.

First accesses to memory will be more expensive.

Time to generate and DMA first frame: 0.048529s

Time to generate and DMA second frame: 0.016370s

Time to generate and DMA subsequent frames: 0.015944s — slightly faster than with no DMA.

There’s still plenty of work to be done before it’s functional and interesting, but I may yet hit my goal of 60 fps for this project :)

Averaging more unsigned chars

Some further thoughts, continuing on from my previous average post

Alternate methods

sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo.  sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be all that would be needed to turn the result from two sumb calls into the desired averages.

Unfortunately, the input format I’m using isn’t well suited to sumb and it would appear to require a prohibitive number of shuffles to prepare the data appropriately – that said, there’s at least one place where I shuffle data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.

The avgb instruction calculates the average of the bytes in two quadwords.  It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there’s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I’ve not yet been able to wrap my head around the correct method to do so.

Approximating

There are plenty of ways to very quickly get a result that is often — but not always — correct (like avgb).  One of these methods may be suitable for my particular needs, but I won’t know until later.  My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.

Adding

I’m annoyed with myself that I missed this one, as I’ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));

adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // add 2
    L = si_a(L, si_ilh(0x202));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    return R;
}

The difference is pretty minor — a couple of instructions and (when not inlined) it’s no faster.  For the program it’s used in I’m seeing around a 1.5% runtime reduction.

ioctl() and the spu — a tale of many magics

I’d achieved a working implementation of the small spu project I’ve been messing around with and wanted to move on to start pushing particular pixels (rather than debugging in text mode). My program, up to this point, has been a simple spu-only piece of code (launched via the elfspe loader under Linux), and I was reluctant to tack on a some kind of ppu manager just to set up the framebuffer — so, I though, why not do it from the spu itself? (Why not? Because it’s a stupid idea and so on.)

What’s needed to “set up the framebuffer”?  Magic.  Fortunately, the magic has been well documented by Mike Acton and you can get the source here. The trick becomes merely to get those files compiled for the spu, and the biggest barrier to doing that is the ioctl() system call.

There was no existing way to call ioctl() directly from the spu and several possible approaches were apparent.

Callback handler in libspe

libspe provides a number of callback functions for spu programs, allowing them some degree of access to and control of the wider system.  These are built on the __send_to_ppe() function found in newlib. __send_to_ppe() packs up a function’s arguments and stops the spu, signaling to libspe which of several handlers should be utilised.

My first partially-working attempt to access ioctl() used this approach, and once I’d worked out how to debug my way into libspe and back to the spu the solution was quite straightforward.  Particularly, libspe mmaps the spu’s local store, making it possible to pass a pointer to the spu’s local store directly to ioctl().

The third argument to ioctl() can be a pointer or a value and so some decoding of the ‘request’ argument would be required to handle particular cases correctly. While that’s probably not too difficult (probably — I’m not clear on the details), it’s beyond what I was interested in pursuing. For syscalls, though, there is a ‘better’ way…

Stopcode 0x2104

With __send_to_ppe(), newlib provides a similar __linux_syscall() function.  I couldn’t make much sense of it when I first looked at it as the 0x2104 stopcode it uses is not intercepted by libspe. Instead, it is magic and is intercepted directly by the kernel for the express purpose of handling syscalls. Hooray?

Almost. If the third argument to ioctl() is a pointer, it needs to contain a valid ea. Unfortunately, there appears to be no sane way for a spu to mmap it’s own local store, meaning an lsa cannot be converted to a valid ea without some external assistance, and so we must consider other options…

__ea

Named address spaces are an extension to the C language that provide a way to access eas from a spu in a somewhat transparent fashion.  On the spu, these work by qualifying pointers with __ea to indicate that they contain an ea, and the complier handles the intervening magic (gcc-4.5 and the various versions of gcc and xlc that have shipped with the IBM SDK have support for this extension).

While this method might work, I didn’t want to give up a chunk of my local store for the software cache that the implementation in gcc uses — I have a hunch that I’ll will almost run out of space when I add some extra DMA buffers to this program. (And I’m not sure what’s needed to get  __ea storage working correctly without a separate ppu context).

A more convoluted method

In the absence of a neater approach, here’s how I got it working:

  1. mmap some scratch space in main memory (using mmap_eaddr() from newlib)
  2. call ioctl() with the address of that scratch space
  3. DMA the data from the scratch space to local store

Something like:

void* scratch = (void*)(uint32_t)mmap_eaddr(0ULL, size, PROT_READ|PROT_WRITE,
                                            MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ioctl(fb_fd, FBIOGET_VBLANK, scratch);
struct fb_vblank vblank;
spu_mfcdma32(&vblank, (uint32_t)scratch, ROUNDUP16(sizeof(vblank)),
             0, MFC_GET_CMD);
mfc_write_tag_mask(1);
spu_mfcstat(MFC_TAG_UPDATE_ALL);

And so, ugly as it is, I can now call ioctl() from spu code.  Lather, rinse & repeat as desired.

Of course, there were a few other problems to solve: because I was compiling cp_{fb,vt}.c for the spu but using the (ppu) system headers to get the fb defns, the compiler sees the prototypes for the system mmap, munmap, printf and snprintf functions. Extra wrappers are needed for each of those (and so I now know about the v*printf family).

Once all that is done, it works. I can push pixels to the screen from the spu.

Flipping and vsync is an interesting, and slightly awkward, case as they are passed the address of a value, not the value itself (for no readily apparent reason), so the values are allocated early on and referenced as needed.

The implementation can be found here. It’s functional. :)  There are many opportunities to tidy it up and simplify it. Right now it’s a fairly direct and naive port from the original.

My thanks to Mike Acton for the original setup code, and to Ken Werner for helping me get my head around the problem and its possible solutions.

Averaging unsigned chars

I’ve been toying with implementing the diamond-square algorithm (for a colourful plasma effect) on the spu. The value of each pixel is calculated by averaging values of the pixels around it.

Originally, I toyed with calculating the average by unpacking the pixel uchars, converting to float, performing the calculations, converting back and repacking. This seemed to work and made the calculation nice and simple. (And the scaling factor provided by cuflt/cfltu meant I could avoid some other computation when randomly-perturbing the result, which was nice). Regardless, I did wonder what it would take to process the uchars directly.

To begin with, it is clear that (a + b + c + d)/4 = (a/4 + b/4 + c/4 + d/4). Division by powers of two are able to be quickly approximated with shifts, and doing so will prevent overflow in the case that (a + b + c + d)>255, so we get

R = (a>>2) + (b>>2) + (c>>2) + (d>>2);

but R will be as much as 3 less than the actual average.  To fix this up, the average of the lower two bits can be calculated and added back to the result.

L = (a&3) + (b&3) + (c&3) + (d&3); // sum the lower two bits of each point
R += (L>>2);   // divide by four and add to the result
R += (L>>1)&1; // round up based on the sum of lower bits

R should be an accurate average of the four uchars, without overflow.

Because there’s no overflow, this will work with sixteen uchars at a time – with a little care to deal with the lack of certain byte-centric spu instructions.

qword average4(qword a, qword b, qword c, qword d) {
    // shift each right by 2 bits, masking shifted-in bits from the result
    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);
    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);
    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);
    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);

    // add them all up
    qword R = si_a(si_a(au,bu), si_a(cu,du));

    // add up the lower bits
    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),
                   si_a(si_andbi(c,3),si_andbi(d,3)));

    // shift right 2 bits, again masking out shifted-in high bits
    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));

    // shift right and mask for the rounding bit
    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));
    return R;
}

(24 instructions – 20ish cycles)

While it is fast enough for my present use, I’m sure it’s ripe for improvement.

qword

“Layers are Leaking”

Andrew Richards (@codeandrew), the CEO of Codeplay posted an article “Layers are Leaking”, which inspired some thoughts – tangential and perpendicular – beyond what I could reasonably fit into a stream of tweets (not that that stopped me trying).

From the article:

If you look at any computer system architecture diagram (and there are many) you will almost always see a series of “layers” on top of each other. Each layer is based on a layer underneath. It is a very common pattern in computers, and I believe that it is fundamental to the success of the development of computer technology. What layers allow you to do, is let individual people concentrate on only one part of a problem.

This brought to mind the first chapter of Bell & Newell’s Computer Structures book from 1971, which illustrates this kind of layering idea.  Additionally, I was reminded of Jeannette Wing’s excellent Computational Thinking article which (amongst other things) makes the case that this type of thinking, associated with computer science, is something that is worth teaching and applying more widely.

But, recently, the layers have started leaking.

I don’t really agree that this is a recent thing (layers leak, abstractions change – some leaks are more obvious than others), but the article refers to the significant changes that have come about in the eternal quest for faster computation, and the difficulties involved in replacing established abstractions. Renegotiating the interface between two layers is difficult when the people responsible for each layer don’t speak the same technical language. What’s worse is trying to change the interface to one layer without adequately considering the consequences to other layers in the system. I think time-consuming discussion and debate is inevitable, and it is not inherently a reason to avoid changing an otherwise broken abstraction.

A number of the problems (or – dare I cliché it – opportunities) that are appearing with regard to increasingly parallel computation are not new. Consider Peter Denning’s recent article. I find it interesting that a number of the suggested solutions involve changes to deeply ingrained abstractions (like programming languages) for which there would be (is) significant resistance to change.

I read an interview with Fred Brooks recently, and the following quote stood out:

Software is not the exception; hardware is the exception. No technology in history has had the kind of rapid cost/performance gains that computer hardware has enjoyed. Progress in software is more like progress in automobiles or airplanes: We see steady gains, but they’re incremental.

Which echoes something that Andrew said in another article (“Why Intel Larrabee Really Stumbled” – google cache is here)

Infinite Truth: Software takes longer to develop than Hardware

A lot of time has been spent on steady, incremental gains for software that attempts to work around the limitations of many common layer abstractions – examples that spring to mind include compiling general purpose programs for small register sets, converting scalar programs to utilise vector instructions, and trying to make parallelism out of inherently single threaded code. We can’t necessarily redesign our systems to make these problems go away, but I think we can strive to address these problems in the best way possible – which will likely involve reshuffling some layers in the process. It’s likely to be messy, time consuming, and require a lot of hard work.

Sounds like fun :)

My thanks to Andrew for his thought-provoking article.

Additional:
Stephen Hill (@self_shadow) tweeted links to wikipedia on Leaky abstractions, and a related article by Joel Spolsky.

But, recently, the layers have started leaking.

spu_µopt[0] // shifting and arrays

There are a couple of mostly-obvious SPU micro-optimisation tricks that I’ve learned. Here’s one.

qword a[128];
qword f(int i) {
    return a[i>>4];
}

Simple enough code. spu-gcc -O3 produces

f:                                                                                                                                                                                                                                                                                                                          
    ila     $2,a                                                                                                                                                                                                                                                                                                        
    andi    $3,$3,-16                                                                                                                                                                                                                                                                                                   
    lqx     $3,$2,$3                                                                                                                                                                                                                                                                                                    
    bi      $lr

And it’s worse if the shift is not 4 – here’s the result of using 5:

f:
    rotmai  $4,$3,-5
    ila     $2,a
    shli    $3,$4,4
    lqx     $3,$2,$3
    bi      $lr

Which is pretty ugly.

The compiler is doing what it has been asked to do, but doesn’t appear to know that SPU load and store instructions force the four rightmost bits of the calculated address to zero.  Which means the andi masking can be removed from the first case, and in the second the shift right and shift left instructions could be replaced with a single shift right – without changing the result.

How to avoid it?  I’ve used manual pointer arithmetic to get around this compiler feature, e.g.

return *(qword*)((uint)a+i)

Which is pretty ugly.

I don’t know what that sort of thing does for the compiler’s alias analysis. Probably nothing good (judicious use of restrict might help…).  si_lqx() would also work, but my  experience – without in-depth examination – is that measured performance gets worse when using load and store intrinsics. I’m not sure if it’s something to do with scheduling, aliasing or programmer competence.

When would this ever matter? I’ve run into this in some very hot lookup table access. Two or four less cycles can make a big difference. Sometimes. (If it really matters, rewriting the whole thing in assembly can work, too)

It’s a micro-optimisation – measure often, use with care.

“premature optimization”

Here is an except from Donald Knuth’s 1974 paper “Structured Programming with go to Statements”.  There is a well known quote at the end of the second paragraph :

The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant.  The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn’t bother making such optimizations on a one-shot job, but when it’s a question of preparing quality programs, I don’t want to restrict myself to tools that deny me such efficiencies.

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Yet we should no pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I’ve become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

The quote to which I am referring – that “premature optimization is the root of all evil” – is, from time to time, used fait accompli to assert that any time spent optimising is time wasted – something that is clearly not the case when considering the quote in context.

There are a lot of other good ideas in Knuth’s article, many still quite relevant to software engineering. The full article (which can be obtained via citeseerx) is, primarily, a contribution to a  discussion of the time on the use of go to statements – another topic often reduced to a simple, extreme position, attributed to Edsger Dijkstra.  Knuth’s paper makes it clear that Dijkstra did not hold so tightly to that position himself.

This is the sort of paper that I find to be a fascinating read, with both historical and technical value.

The final page of the article contains the following quote :

A disturbingly large percentage of the students ran into situations that require go to‘s, and sure enough, it was often because while didn’t work well to their plan, but almost invariably because their plan was poorly thought out.

Which brought to mind another quote that I had seen recently, this one from George Orwell :

A man may take to drink because he feels himself to be a failure, and then fail all the more completely because he drinks. It is rather the same thing that is happening to the English language. It becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

It seems quite valid to say that there is a connection between our language and our expression (in terms of ideas, spoken word, programs, and more).

When trying to find a source for Orwell’s words (which you can find here), I was heartened to find these sentences follow those quoted above :

The point is that the process is reversible.   Modern English, especially written English, is full of bad habits which spread by imitation and which can be avoided if one is willing to take the necessary trouble.

Which I believe is also applicable to the expression and optimisation of computer programs.