Bugs of the day, Tuesday 2013-03-19

Changed virtual function prototype results in correct function not being called

This one wasn’t a bug that landed immediately from my change, but happened when a subsequent code branch was integrated. I’m going to claim this as my bug as it could have been prevented (or at least made more obvious) with some extra though.

The problem:

We have a virtual function, something like

class Base {
public:
  virtual void Foo(int some_param)  {  }
};

Which I modified a few weeks back by adding an extra parameter, so foo() became

  virtual void foo(int some_param, WhatsitType some_other_param) { }

and I added the same parameter to every one of the classes that inherited from class Base. There were quite a few.

Sometime later,  code was integrated from another branch where the old prototype was still in use, including something like this:

class NewInheritor : public Base {
public:
  virtual void Foo(int some_param) { // oh noes - it's the old prototype!
    DoAmazingThings();
  }
};

The new code compiled without warning, but at runtime, NewInheritor::Foo() wasn’t being called, so the amazing things were failing to be done. Debugging this was somewhat inefficient — the process for calling Foo() methods includes a number of steps, and I failed to notice the missing parameter for quite a while.

The fix:

Update NewInheritor::Foo() to match the parent class.

Preventing these in future:

Remember that virtual inheritance is a horribly fragile beast.

One thing that may have worked is to make Foo() pure (compile error) — or undefined (link error) — in the base class (i.e. void Foo() = 0;). This isn’t a great match in this case as inheriting classes don’t need to have their own implementation of Foo() to be valid, and requiring it makes for some additional tedious boilerplate code in every inheritor (of which there are many).

The override keyword will help detect this at compile time, but it relies on the implementer of NewInheritor to have used it :\ The action here is to recommend that people use override on every virtual function override.

In this case, what I do know about this code is that it is never valid to call Foo() on a class that doesn’t have its own implementation. We can put an assert() in Base::Foo() and let the writers of inheriting types (or me when next I get to debug one of these) have a more obvious clue about what’s happening. Adequately functional.

memcpy() into an uninitialized pointer variable

The problem:

I had changed the type of a variable from being a static array to being a pointer, to match a change I’d made to a particular function. What I hadn’t paid attention to was that this array was being used as the destination address for a call to memcpy(). An uninitialized pointer on the stack is not a good place to try to copy things.

The call to memcpy() did not cause the program to crash. The program did crash shortly thereafter, though, which distracted me for a short while from the cause.

The fix:

Explicitly allocate some space for the copy.

Preventing in the future:

The pointer on the stack had not been set. A null pointer would have caused the crash to happen at the call to memcpy(), making the cause more obvious.

Bad loop counting

The problem:

int max = m_Max;
int i = 0;
for( ; i < max; ++max )
  if( a[i] == bar )
    break;

++max? Yeesh.

m_Max is initialized to zero, which means that this was a no-op rather than an infinite loop.

The fix:

Increment the counter, not the upper bound.

Preventing in the future:

Don’t be a goose.

Failing to store updated state

The problem:

Immediately following the code above, in the case where we failed to find the desired bar in the array, increase the size of the array. And then fail to store the new max size into m_Max.

I even got this wrong twice: my first attempt was to replace a use of max+1 with ++max.

The fix:

Write the necessary state back to the object: ++m_Max.

Preventing in the future:

Pay attention to where the state should go. Don’t be a goose.

Stop the bugs

I like to try and fix as many of the bugs that I write as early as possible, before they’re deployed to the rest of the team — I hate inconveniencing others with my own defect-laden code. As such, my preference is to do what I can to catch the bugs I write as early as I can.

I rely very heavily on the compiler to catch particular classes of bugs for me, to the point that (unlike my early computer science lecturers) I don’t look at a compile error to be a bug: the compiler is very good at reminding me about parts of the code I’m writing that I haven’t finished changing yet, and does so quickly and with minimal fuss. If I change the assumptions or guarantees of a particular function, deliberately changing the function so that callers will no longer compile is — for some projects — not a terrible way to start tracing through parts of a codebase that might be affected. This can include changing arguments to a different order, more args, fewer args, changing the function name, #if 0/#endif around the function, etc.

There are various scenarios where this may not be so useful — very widely used functions, virtual functions, changes to a library used by a number of disparate programs, etc — but I find that it can be more efficient than grepping through a large number of files, particularly when there’s some ambiguity in the symbol name you’re searching for.

Some other ways to have the compiler error for you include leaving symbols undefined so that the compiler error list becomes something of a todo list, making class members private, removing methods from a base class. There are probably others I use. Many of these changes can be applied as a short-lived change to learn more about potential consequences of a change, and are typically easily revertable (assuming the use of any basic version control system).

Compiler warnings are also well worth considering and, if you operate with warnings-as-errors, impossible to avoid. I’ve found plenty of bugs that would have be horrible to diagnose from their symptoms thanks to the warnings of a good compiler.

Compile-time/static asserts are also a good thing. You can never have too many of those.

After the compile-time stuff, next is linker errors. If it all compiles but won’t link, the undefined symbol list can again become that todo list of functions that you haven’t yet got around to implementing. The downside with link errors is that for a larger program it takes a lot longer to compile and attempt linking the whole thing before you find out about the work yet to be done. There are other tricks that can be done with the linker to deliberately break the build, but that’s beyond what I want to write about right now.

When it all compiles and links, finding the bugs is a matter for the runtime. You need to run the code and find out if it will work or not (for the purpose of this article, I’m going to treat “running a test suite” as “running the code”). Any defects present at this stage typically take more time to identify, and more time to fix — even if it is something blatantly obvious, the edit, compile, link, run cycle can be much longer than just edit, compile.

It’s at this point I get to my intended purpose of this blog post: I want to write fewer bugs. Various classes of bugs are prevented for me by the compiler and linker, but if it’s something that gets all the way to failing at runtime (whatever that means), I’m left disappointed. What to be done about it?

What I’m going to try today and, hopefully, continue to do after today, is keep track of bugs I wrote that survived to the point of being run. I’ll give some thought to how they got there and consider if there was something I could have done (and can do in the future) to prevent that kind of bug. And maybe I’ll find out if I keep making the same kinds mistakes over and over again, or if I can learn to write fewer bugs.

I shall attempt to begin with today’s bugs in another post.

Some notes on memcpy

Recently I encountered a bug due to memcpy() being used like this:

    memcpy( elem, elem + 1, n * sizeof(Element) );

What’s the problem?

The prototype for memcpy is defined something like:

    void* memcpy(void* dest, const void* source, size_t size);

memcpy is wonderfully simple – use it to copy size bytes from source to dest. But the definition of memcpy from the standard goes something like this:

“The memcpy() function shall copy n bytes from the object pointed to by s2 into the object pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.” [source]

If you notice the example at the top of the email, the copy is from elem + 1 to elem with a size of (potentially) more than one Element. It’s quite possible that objects overlap. And they did.

The magic word in the definition is undefined. When you do something that is undefined in C or C++, anything can happen. The compiler can format your hard drive, email a resignation letter to your manager, there could be nose goblins. Or your data could become corrupt. Relying on undefined behavior is a bad idea.

Nose goblins notwithstanding, what can actually go wrong for memcpy? If memcpy is implemented as walking forward through the source range and writing to the destination, the above case should be OK, right?

Well, maybe.

But if it starts at the end of the source array and works backwards, the destination range will end up with corrupt data.

Probably.

Even if memcpy walks forward, data could be corrupted if the source has a lower address than the destination (and they overlap). There are plenty of ways to write a fast memcpy implementation that can corrupt data for overlapping ranges. And implementations that will cause problems exist and are widely used: e.g. http://lwn.net/Articles/414467/ Even better are implementations that will step forward through memory for small values of size, and backwards for larger ones.

Conclusion: don’t use memcpy for overlapping ranges. Instead, use memmove which is well defined if the ranges overlap:

“The memmove() function shall copy n bytes from the object pointed to by s2 into the object pointed to by s1. Copying takes place as if the n bytes from the object pointed to by s2 are first copied into a temporary array of n bytes that does not overlap the objects pointed to by s1 and s2, and then the n bytes from the temporary array are copied into the object pointed to by s1.” — [source]

(With thanks to Andreas for his help in finding the source of the bug that inspired this piece)

Assembly Primer Part 7 — Working with Strings — ARM

These are my notes for where I can see ARM varying from IA32, as presented in the video Part 7 — Working with Strings.

I’ve not remotely attempted to implement anything approximating optimal string operations for this part — I’m just working my way through the examples and finding obvious mappings to the ARM arch (or, at least what seem to be obvious). When I do something particularly stupid, leave a comment and let me know :)

Working with Strings

.data
     HelloWorldString:
        .asciz "Hello World of Assembly!"
    H3110:
        .asciz "H3110"

.bss
    .lcomm Destination, 100
    .lcomm DestinationUsingRep, 100
    .lcomm DestinationUsingStos, 100

Here’s the storage that the provided example StringBasics.s uses. No changes are required to compile this for ARM.

1. Simple copying using movsb, movsw, movsl

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $Destination, %edi
    movw r1, #:lower16:Destination
    movt r1, #:upper16:Destination

    @movsb
    ldrb r2, [r0], #1
    strb r2, [r1], #1

    @movsw
    ldrh r3, [r0], #2
    strh r3, [r1], #2

    @movsl
    ldr r4, [r0], #4
    str r4, [r1], #4

More visible complexity than IA32, but not too bad overall.

IA32’s movs instructions implicitly take their source and destination addresses from %esi and %edi, and increment/decrement both. Because of ARM’s load/store architecture, separate load and store instructions are required in each case, but there is support for indexing of these registers:

ARM addressing modes

According to ARM A8.5, memory access instructions commonly support three addressing modes:

  • Offset addressing — An offset is applied to an address from a base register and the result is used to perform the memory access. It’s the form of addressing I’ve used in previous parts and looks like [rN, offset]
  • Pre-indexed addressing — An offset is applied to an address from a base register, the result is used to perform the memory access and also written back into the base register. It looks like [rN, offset]!
  • Post-indexed addressing — An address is used as-is from a base register for memory access. The offset is applied and the result is stored back to the base register. It looks like [rN], offset and is what I’ve used in the example above.

2. Setting / Clearing the DF flag

ARM doesn’t have a DF flag (to the best of my understanding). It could perhaps be simulated through the use of two instructions and conditional execution to select the right direction. I’ll look further into conditional execution of instructions on ARM in a later post.

3. Using Rep

ARM also doesn’t appear to have an instruction quite like IA32’s rep instruction. A conditional branch and a decrement will be the long-form equivalent. As branches are part of a later section, I’ll skip them for now.

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $DestinationUsingRep, %edi
    movw r1, #:lower16:DestinationUsingRep
    movt r1, #:upper16:DestinationUsingRep

    @movl $25, %ecx # set the string length in ECX
    @cld # clear the DF
    @rep movsb
    @std

    ldm r0!, {r2,r3,r4,r5,r6,r7}
    ldrb r8, [r0,#0]
    stm r1!, {r2,r3,r4,r5,r6,r7}
    strb r8, [r1,#0]

To avoid conditional branches, I’ll start with the assumption that the string length is known (25 bytes). One approach would be using multiple load instructions, but the load multiple (ldm) instruction makes it somewhat easier for us — one instruction to fetch 24 bytes, and a load register byte (ldrb) for the last one. Using the ! after the source-address register indicates that it should be updated with the address of the next byte after those that have been read.

The storing of the data back to memory is done analogously. Store multiple (stm) writes 6 registers×4 bytes = 24 bytes (with the ! to have the destination address updated). The final byte is written using strb.

4. Loading string from memory into EAX register

    @cld
    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @lodsb
    ldrb r1, [r0, #0]

    @movb $0, %al
    mov r1, #0

    @dec %esi  @ unneeded. equiv: sub r0, r0, #1
    @lodsw
    ldrh r1, [r0, #0]

    @movw $0, %ax
    mov r1, #0

    @subl $2, %esi # Make ESI point back to the original string. unneeded. equiv: sub r0, r0, #2
    @lodsl
    ldr r1, [r0, #0]

In this section, we are shown how the IA32 lodsb, lodsw and lodsl instructions work. Again, they have implicitly assigned register usage, which isn’t how ARM operates.

So, instead of a simple, no-operand instruction like lodsb, we have a ldrb r1, [r0, #0] loading a byte from the address in r0 into r1. Because I didn’t use post indexed addressing, there’s no need to dec or subl the address after the load. If I were to do so, it could look like this:

    ldrb r1, [r0], #1
    sub r0, r0, #1

    ldrh r1, [r0], #2
    sub r0, r0, #2

    ldr r1, [r0], #4

If you trace through it in gdb, look at how the value in r0 changes after each instruction.

5. Storing strings from EAX to memory

    @leal DestinationUsingStos, %edi
    movw r0, #:lower16:DestinationUsingStos
    movt r0, #:upper16:DestinationUsingStos

    @stosb
    strb r1, [r0], #1
    @stosw
    strh r1, [r0], #2
    @stosl
    str r1, [r0], #4

Same kind of thing as for the loads. Writes the letters in r1 (being “Hell” — leftovers from the previous section) into DestinationUsingStos (the result being “HHeHell”). String processing on little endian architectures has its appeal.

6. Comparing Strings

    @cld
    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString
    @leal H3110, %edi
    movw r1, #:lower16:H3110
    movt r1, #:upper16:H3110

    @cmpsb
    ldrb r2, [r0,#0]
    ldrb r3, [r1,#0]
    cmp r2, r3

    @dec %esi
    @dec %edi
    @not needed because of the addressing mode used

    @cmpsw
    ldrh r2, [r0,#0]
    ldrh r3, [r1,#0]
    cmp r2, r3

    @subl $2, %esi
    @subl $2, %edi
    @not needed because of the addressing mode used
    @cmpsl
    ldr r2, [r0,#0]
    ldr r3, [r1,#0]
    cmp r2, r3

Where IA32’s cmps instructions implicitly load through the pointers in %edi and %esi, explicit loads are needed for ARM. The compare then works in pretty much the same way as for IA32, setting condition code flags in the current program status register (cpsr). If you run the above code, and check the status registers before and after execution of the cmp instructions, you’ll see the zero flag set and unset in the same way as is demonstrated in the video.

The condition code flags are:

  • bit 31 — negative (N)
  • bit 30 — zero (Z)
  • bit 29 — carry (C)
  • bit 28 — overflow (V)

There’s other flags in that register — all the details are on page B1-16 and B1-17 in the ARM Architecture Reference Manual.

And with that, I think we’ve made it (finally) to the end of this part for ARM.

Other assembly primer notes are linked here.

{1,2,3,4}

(This is wonderfully obtuse, but amused me :)

Neil Henning (@sheredom) asked:

SPU gurus of twitter unite, want a vector unsigned int with {1, 2, 3, 4} in each slot, without putting it in as elf constant, any ideas?

Interesting question. The SPU ISA generally doesn’t help build vectors with different values in each slot. In this case, there are only very small values required in each register, so it can be done with a neat little trick.

My answer:

    fsmbi r4, 0x7310  # r4 = {0x00ffffff, 0x0000ffff, 0x000000ff, 0x00000000}
    clz r5, r4        # r5 = {8,16,24,32}
    rotmi r6, r5, -3  # r6 = {1,2,3,4}

Instructions are:

  • fsmbi — form select mask byte immediate. Creates a 128 bit mask from a 16 bit value, expanding each bit of input to 8 bits of output.
  • clz — count leading zeroes. Counts the number of leading zeros in each word.
  • rotmi — rotate and mask word immediate (logical shift right by negative immediate). Shifts each word right by the negation of number of bits specified.

This solution is entirely self contained, required no pre-set state (unlike my first attempt utilising the cbd instruction). In terms of raw instruction size, it’s a whole eight bytes smaller than storing the vector in memory and loading it when needed (that being 16+4 bytes), and a little slower than using a load instruction.

(On a cursory re-examination of the SPU ISA, fsmbi is the only instruction that will construct a different value in each word of a register. A specific pattern may be generated with cbd/cbx that can be used for this problem, but it depends on the contents of another register which limits its already limited usefulness. Combining fsmbi with other immediate instructions allows for a wide range of values to be constructed independent of register state and without access to storage)

Assembly Primer Parts 6 — Moving Data — ARM

My notes for where ARM differs from IA32 in the Assembly Primer video Part 6 — Moving Data.

(There is no separate part 5 post for ARM — apart from the instructions, it’s identical to IA32. There’s even support for the .bss section, unlike SPU and PPC)

Moving Data

We’ll look at MovDemo.s for ARM. First, the storage:

.data

    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

It’s the same as for IA32, PPC and SPU. Like the first two, ARM will cope with the unnatural alignment.

1. Immediate value to register

.text
.globl _start
_start:
    @movl $10, %eax

    mov r0, #10

Move the value 10 into register r0.

Something to note: the ARM assembly syntax has some slightly differences. Where others use # to mark the start of a comment, ARM has @ (although # works at the start of a line). Literal values are prefixed with #, which confuses the default syntax highlighting in vim.

2. Immediate value to memory

    @movw $50, Int16

    mov r1, #50
    movw r0, #:lower16:Int16
    movt r0, #:upper16:Int16
    strh r1, [r0, #0]

We need to load the immediate value in a register (r1), the address in a register (r0) and then perform the write. To quote the Architecture Reference Manual:

The ARM architecture … incorporates … a load/store architecture, where data processing operations only operate on register contents, not directly on memory contents.

which is like PPC and SPU, and unlike IA32 — and so we’ll see similarly verbose alternatives to the IA32 examples from the video.

I’m using movw, movt sequence to load the address, rather than ldr (as mentioned in the previous installment).

strh is, in this case, Store Register Halfword (immediate) — writes the value in r1 to the address computed from the sum of the contents of r0 and the immediate value of 0.

3. Register to register

    @movl %eax, %ebx

    mov r1,r0

mov (Move) copies the value from r0 to r1.

4. Memory to register

    @movl Int32, %eax

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

Load the address into r0, load from the address r0+0. Here ldr is Load Register (immediate).

5. Register to memory

    @movb $3, %al
    @movb %al, ByteLocation

    mov r0, #3
    movw r1, #:lower16:ByteLocation
    movt r1, #:upper16:ByteLocation
    strb r0, [r1, #0]

Once again the same kind of thing — load 3 into r0, the address of ByteLocation into r1, perform the store.

6. Register to indexed memory location

    @movl $0, %ecx
    @movl $2, %edi
    @movl $22, IntegerArray(%ecx, %edi, 4)

    movw r0, #:lower16:IntegerArray
    movt r0, #:upper16:IntegerArray
    mov r1, #2
    mov r2, #22
    str r2, [r0, r1, lsl #2]

A little more interesting — here str is Store Register (register) which accepts two registers and an optional shift operation and amount. Here lsl is logical shift left, effectively multiplying r1 by 4 — the size of the array elements.

(GCC puts asl here. Presumably identical to logical shift left, but there’s no mention of asl in the Architecture Reference Manual. Update: ASL is referenced in the list of errors here as an obsolete name for LSL)

Two source registers and a shift is still shy of IA32’s support for an calculating an address from a base address, two registers and a multiply.

7. Indirect addressing

    @movl $Int32, %eax
    @movl (%eax), %ebx

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

    @movl $9, (%eax)

    mov r2, #9
    str r2, [r0, #0]

More of the same.

Concluding thoughts

In addition to the cases above, ARM has a number of other interesting addressing modes that I shall consider in more detail in the future — logical operations, auto-{increment, decrement} and multiples. Combined with conditional execution, there are some very interesting possibilities.

Other assembly primer notes are linked here.

Assembly Primer Part 4 — Hello World — ARM

On to Assembly Primer — Part 4. This is where we start writing a small assembly program for the platform. In this case, I don’t know the language and I don’t know the ABI. Learning these from scratch ranges from interesting to tedious :)

Regarding the language (available instructions, mnemonics and assembly syntax): I’m using the ARM Architecture Reference Manual as my reference for the architecture (odd, I know). It’s very long and the documentation for each instruction is extensive — which is good because there are a lot of instructions, and many of them do a lot of things at once.

Regarding the ABI (particularly things like argument passing, return values and system calls): there’s the Procedure Call Standard for the ARM Architecture, and there are a few other references I’ve found, such as the Debian ARM EABI Port wiki page.

“EABI is the new “Embedded” ABI by ARM ltd. EABI is actually a family of ABI’s and one of the “subABIs” is GNU EABI, for Linux.”

– from Debian ARM EABI Port

System Calls

To perform a system call using the GNU EABI:

  • put the system call number in r7
  • put the arguments in r0-r6 (64bit arguments must be aligned to an even numbered register i.e. in r0+r1, r2+r3, or r4+r5)
  • issue the Supervisor Call instruction with a zero operand — svc #0

(Supervisor Call was previously named Software Interruptswi)

Just Exit

Based on the above, it’s not difficult to reimplement JustExit.s (original) for ARM.

.text

.globl _start

_start:
        mov r7, #1
        mov r0, #0
        svc #0

mov here is Move (Immediate) which puts the #-prefixed literal into the named register.

Hello World

Likewise, the conversion of HelloWorldProgram.s (original) is not difficult:

.data 

HelloWorldString:
      .ascii "Hello World\n"

.text 

.globl _start 

_start:
      # Load all the arguments for write () 

      mov r7, #4
      mov r0, #1
      ldr r1,=HelloWorldString
      mov r2, #12
      svc #0

      # Need to exit the program 

      mov r7, #1
      mov r0, #0
      svc #0

This includes the load register pseudo-instruction, ldr — the compiler stores the address of HelloWorldString into the literal pool, a portion of memory located in the program text, and the 32bit address is loaded from the literal pool (more details).

When compiling a similar C program with -mcpu=cortex-a8, I notice that the compiler generates Move (immediate) and Move Topmovw and movt — instructions to load the address directly from the instruction stream, which is presumably more efficient on that architecture.

Assembly Primer Parts 1, 2 and 3 — ARM

I had started a series of posts on assembly programming for the Cell BE PPU and SPU, based on the assembly primer video series from securitytube.net. I have recently acquired a Nokia N900, and so thought I might take the opportunity to continue the series with a look at the ARM processor as well.

Wikipedia lists the N900’s processor as a Texas Instruments OMAP3430, 600MHz ARMv7 Cortex-A8. I’m not at all familiar with the processor family, so I’ll be attempting to find out what all of this means as I go :P

I’ve set up a cross compiler on my desktop machine using Gentoo’s neat crossdev tool (built using crossdev -t arm-linux-gnueabi). The toolchain builds a functional Hello, World!

(I note that scratchbox appears to be the standard tool/environment used to build apps for Maemo — I may take a closer look at that at a later date)

I have whatever the latest public ‘stable’ Maemo 5 release is on the N900 (PR 1.3, I think),  with an apt-get install openssh gdb — thus far, enough to “debug” a functional Hello, World!

What follows are some details of the Cortex-A8 architecture present in the N900, particularly in how it differs from IA32, as presented in the videos Part 1 — System Organisation, Part 2 — Virtual Memory Organization and Part 3 — GDB Usage Primer. I’ve packed them all into this post because gdb usage and Linux system usage are largely the same on ARM as they are on PPC and IA32.

Most of the following information comes from the ARM Architecture Reference Manual.

(The number of possible configurations of ARM hardware makes it interesting at times to work out exactly which features are present in my particular processor. From what I can tell, the N900’s Cortex-A8 is ARMv7-A and includes VFPv3 half, single and double precision float support, and NEON (aka Advanced SIMD). I expect I’ll find out more when I actually start to try and program the thing. As to which gcc -march, -mcpu or -mfpu options are most correct for the N900 — I have no idea.)

1. Registers

Integer

There are sixteen 32bit ARM core registers, R0 to R15, where R0–R12 are for general use. R13 contains the stack pointer (SP), R14 is the link register (LR), and R15 is the program counter (PC).

The current program status register (CSPR) contains various status and control bits.

VFPv3 (Floating point) & NEON (Advanced SIMD)

There are thrirty two doubleword (64bit) registers, that can be referenced in a number of ways.

NEON instructions can access these as thirty two doubleword registers (D0–D31) or as sixteen quadword registers (Q0–Q15), able to be used interchangeably.

VFP instructions can view the same registers as 32 doubleword registers (again, D0–D31) or as 32 single word registers (S0–S31). The single word view is packed into the first 16 doubleword registers.

Something like this pic (click to embiggen):

VFP in this core (apparently) supports single and double precision floating point data types and arithmetic, as well as half precision (possibly in two different formats…).

NEON instructions support accessing values in extension registers as

  • 8, 16, 32 or 64bit integer, signed or unsigned,
  • 16 or 32bit floating point values, and
  • 8 or 16bit polynomial values.

There’s also a floating point status and control register (FPSCR).

2. Virtual Memory Organisation

On this platform, program text appears to be loaded at 0x8000.

After an echo 0 > /proc/sys/kernel/randomize_va_space, the top of the stack appears to be 0xbf000000.

3. SimpleDemo

Compared to the video, there are only a couple of small differences when running SimpleDemo in gdb on ARM.

Obviously, the disassembly is not the same as for IA32. Rather than the call instructions noted in the video, you’ll see bl (Branch with Link) for the various functions called.

Where the return address is stored on the stack for IA32, the link register (lr in info registers output) stores the return address for the current function, although lr will be pushed to the stack before another function is called.

(From a cursory googling, it seems that to correctly displaying all VFP/NEON registers requires gdb-7.2 — I’m running the 6.8-based build from the maemo repo. crossdev will build me a gdb I can run on my desktop PC — crossdev -t arm-linux-gnueabi –ex-gdb — but I believe I still need to build a newer gdbserver to run on the N900.)

Other assembly primer notes are linked here.

Proposed updates for praypal.org.au 2011/02/11

Diamond-Square — Randomness, continuity and animation

Randomness is nice, except when you have too much.

It’s easy to blast random pixels all over the screen, but it’s much nicer when there’s some continuity in the randomness.

For diamond-square interpolation, I was looking for a way to generate (and use) random seed values to produce continuous-looking results within and between calculated tiles, as well as varying these seed values over time in a somewhat appealing fashion.

To calculate the values for a given tile requires the corner values of that tile along with the corner values of the eight surrounding tiles — that is, all the blue pixels in the picture are needed to calculate the pixels required to calculate the pixels in the centre tile.

The corner values from one tile effects the calculation of the tiles adjacent.  For this reason, the corner values of each tile are state that must remain constant for the entire frame.

To render a full 1920×1080 frame, 15×9 tiles are required, with 16×10 corner values.  To calculate the tiles around the frame border, further un-rendered tile data is required to correctly seed each of these border tiles.  The result of this is that 18×12 random pixel values are needed to define the full set of tiles for a 1080p frame, which means the seeds for the tiles of a frame can be stored in 18x12x4 = 864 bytes.  Not unreasonable.

When rendering each tile, these seed values are first placed in the appropriate 16 locations (taking into account the compressed outer regions), and the interpolation takes place.  (Interpolation is performed on the seed values and, later, using only values calculated during a previous iteration — which means there’s no requirement to zero memory between tiles)

The result with just this looks quite reasonable (imho) — there’s decent interpolation of random values across the screen.  These random seed values can also be permuted without too much difficulty between frames — I’ve tried some trig-function based methods to give a sense of continuous, directional changes.  It works but I’m sure there’s better options.

As for the random perturbing of interpolated values, I’ve tried a few methods and not achieved anything that I’ve been all that happy with.  The problem I’ve had is finding/generating a suitably random value of appropriate scale, and applying it to the pixel value fast enough.

(As I reach the end of this, I have the impression that what I’ve put together is almost, but not entirely unlike, Perlin noise —  you could possibly say that it has a really clumsy interpolation function.  I have a much better understanding of Perlin noise than when I began, and it really looks like I should have tried that.  Learning is Good.)

At some stage in the not too distant future, I hope to perform some screencaps and post some actual images and maybe even a video.  And fwiw I’ll post the source code, too.

Diamond-Square — DMA lists and dirty tricks

There’s a hole in my dataset, dear MFC, dear MFC…

There’s a block of local store that holds a tile of pixels that need to be DMAd out to main memory.  128 lines makes performing many smaller transfers more complex , so we’ll try DMA lists why not.  The problem here is that the pixel data for each line does not match up directly with the next — between each is stored the extra lines of extra-tile information that was needed for the calculation.  When using DMA lists to transfer data to many memory, the source for each DMA transfer starts where the previous one ended. This means that between each DMA there’s (148-128)×4=80 bytes of information that we don’t want to see on the screen.

There’s a lot of things that are “best-practice” for SPU DMA, particularly relating to size and alignment of transfers, and I’m typically a religious adherent.  In this case, I did not comply with best practice and still met my time budget and only feel a small amount of shame :P

Overall, less time is spent performing DMA than is spent calculating pixels, so further optimising DMA for this particular program is unnecessary.

When transferring tiles from the SPU to the framebuffer, there’s three cases of particular interest:

  1. Tiles on the right edge of the screen
  2. Tiles on the lower edge of the screen
  3. The Other Tiles
Tile data in local store

3. The Other Tiles

These are the easy ones.  They will never need to be trimmed to fit the screen edges, and if drawn in the right order have the wonderful characteristic that the extra data needed can be DMAd to the framebuffer and will be overwritten by pixel data from a later tile.  There’s no special case needed, just draw each pixel line and — all data between pixel lines — to the screen.

Tile data drawn to screen

(For the diagrams, the amount of overdraw is far greater than the actual tile part — this is a consequence of the small size.  It’s a much smaller percentage for 128 pixel tiles.  I’ll post some actual screengrabs here sometime…)

1. Tiles on the right edge of the screen

Overdraw isn’t the answer in this case. It is not possible to overdraw on either the left or right of the rightmost tile in a way that will be correct when the screen is finished.  Instead, the extra information (including any portion that may not fit onto the visible screen) must be dealt with some other way.

My solution, ugly as it is, is to write each surplus portion of a tile to a scratch location in memory — every one of them to the same location.  It works :|

2. Tiles on the lower edge of the screen

These tiles are really just like the others, except they’ll stop a few lines short.  They’re still fully calculated, but only the visible lines are transferred.

(In hindsight, increasing the spacing between lines would help reduce the alignment and size problem here.  Adding an extra 48 bytes to each tile would allow every transfer to be optimally aligned and sized.  And would probably make no measurable difference to the total runtime.  Heck, there’s probably enough time to repack the data in local store before performing the DMA. There’s not that much…)