Assembly Primer Part 7 — Working with Strings — ARM

These are my notes for where I can see ARM varying from IA32, as presented in the video Part 7 — Working with Strings.

I’ve not remotely attempted to implement anything approximating optimal string operations for this part — I’m just working my way through the examples and finding obvious mappings to the ARM arch (or, at least what seem to be obvious). When I do something particularly stupid, leave a comment and let me know :)

Working with Strings

.data
     HelloWorldString:
        .asciz "Hello World of Assembly!"
    H3110:
        .asciz "H3110"

.bss
    .lcomm Destination, 100
    .lcomm DestinationUsingRep, 100
    .lcomm DestinationUsingStos, 100

Here’s the storage that the provided example StringBasics.s uses. No changes are required to compile this for ARM.

1. Simple copying using movsb, movsw, movsl

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $Destination, %edi
    movw r1, #:lower16:Destination
    movt r1, #:upper16:Destination

    @movsb
    ldrb r2, [r0], #1
    strb r2, [r1], #1

    @movsw
    ldrh r3, [r0], #2
    strh r3, [r1], #2

    @movsl
    ldr r4, [r0], #4
    str r4, [r1], #4

More visible complexity than IA32, but not too bad overall.

IA32’s movs instructions implicitly take their source and destination addresses from %esi and %edi, and increment/decrement both. Because of ARM’s load/store architecture, separate load and store instructions are required in each case, but there is support for indexing of these registers:

ARM addressing modes

According to ARM A8.5, memory access instructions commonly support three addressing modes:

  • Offset addressing — An offset is applied to an address from a base register and the result is used to perform the memory access. It’s the form of addressing I’ve used in previous parts and looks like [rN, offset]
  • Pre-indexed addressing — An offset is applied to an address from a base register, the result is used to perform the memory access and also written back into the base register. It looks like [rN, offset]!
  • Post-indexed addressing — An address is used as-is from a base register for memory access. The offset is applied and the result is stored back to the base register. It looks like [rN], offset and is what I’ve used in the example above.

2. Setting / Clearing the DF flag

ARM doesn’t have a DF flag (to the best of my understanding). It could perhaps be simulated through the use of two instructions and conditional execution to select the right direction. I’ll look further into conditional execution of instructions on ARM in a later post.

3. Using Rep

ARM also doesn’t appear to have an instruction quite like IA32’s rep instruction. A conditional branch and a decrement will be the long-form equivalent. As branches are part of a later section, I’ll skip them for now.

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $DestinationUsingRep, %edi
    movw r1, #:lower16:DestinationUsingRep
    movt r1, #:upper16:DestinationUsingRep

    @movl $25, %ecx # set the string length in ECX
    @cld # clear the DF
    @rep movsb
    @std

    ldm r0!, {r2,r3,r4,r5,r6,r7}
    ldrb r8, [r0,#0]
    stm r1!, {r2,r3,r4,r5,r6,r7}
    strb r8, [r1,#0]

To avoid conditional branches, I’ll start with the assumption that the string length is known (25 bytes). One approach would be using multiple load instructions, but the load multiple (ldm) instruction makes it somewhat easier for us — one instruction to fetch 24 bytes, and a load register byte (ldrb) for the last one. Using the ! after the source-address register indicates that it should be updated with the address of the next byte after those that have been read.

The storing of the data back to memory is done analogously. Store multiple (stm) writes 6 registers×4 bytes = 24 bytes (with the ! to have the destination address updated). The final byte is written using strb.

4. Loading string from memory into EAX register

    @cld
    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @lodsb
    ldrb r1, [r0, #0]

    @movb $0, %al
    mov r1, #0

    @dec %esi  @ unneeded. equiv: sub r0, r0, #1
    @lodsw
    ldrh r1, [r0, #0]

    @movw $0, %ax
    mov r1, #0

    @subl $2, %esi # Make ESI point back to the original string. unneeded. equiv: sub r0, r0, #2
    @lodsl
    ldr r1, [r0, #0]

In this section, we are shown how the IA32 lodsb, lodsw and lodsl instructions work. Again, they have implicitly assigned register usage, which isn’t how ARM operates.

So, instead of a simple, no-operand instruction like lodsb, we have a ldrb r1, [r0, #0] loading a byte from the address in r0 into r1. Because I didn’t use post indexed addressing, there’s no need to dec or subl the address after the load. If I were to do so, it could look like this:

    ldrb r1, [r0], #1
    sub r0, r0, #1

    ldrh r1, [r0], #2
    sub r0, r0, #2

    ldr r1, [r0], #4

If you trace through it in gdb, look at how the value in r0 changes after each instruction.

5. Storing strings from EAX to memory

    @leal DestinationUsingStos, %edi
    movw r0, #:lower16:DestinationUsingStos
    movt r0, #:upper16:DestinationUsingStos

    @stosb
    strb r1, [r0], #1
    @stosw
    strh r1, [r0], #2
    @stosl
    str r1, [r0], #4

Same kind of thing as for the loads. Writes the letters in r1 (being “Hell” — leftovers from the previous section) into DestinationUsingStos (the result being “HHeHell”). String processing on little endian architectures has its appeal.

6. Comparing Strings

    @cld
    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString
    @leal H3110, %edi
    movw r1, #:lower16:H3110
    movt r1, #:upper16:H3110

    @cmpsb
    ldrb r2, [r0,#0]
    ldrb r3, [r1,#0]
    cmp r2, r3

    @dec %esi
    @dec %edi
    @not needed because of the addressing mode used

    @cmpsw
    ldrh r2, [r0,#0]
    ldrh r3, [r1,#0]
    cmp r2, r3

    @subl $2, %esi
    @subl $2, %edi
    @not needed because of the addressing mode used
    @cmpsl
    ldr r2, [r0,#0]
    ldr r3, [r1,#0]
    cmp r2, r3

Where IA32’s cmps instructions implicitly load through the pointers in %edi and %esi, explicit loads are needed for ARM. The compare then works in pretty much the same way as for IA32, setting condition code flags in the current program status register (cpsr). If you run the above code, and check the status registers before and after execution of the cmp instructions, you’ll see the zero flag set and unset in the same way as is demonstrated in the video.

The condition code flags are:

  • bit 31 — negative (N)
  • bit 30 — zero (Z)
  • bit 29 — carry (C)
  • bit 28 — overflow (V)

There’s other flags in that register — all the details are on page B1-16 and B1-17 in the ARM Architecture Reference Manual.

And with that, I think we’ve made it (finally) to the end of this part for ARM.

Other assembly primer notes are linked here.

{1,2,3,4}

(This is wonderfully obtuse, but amused me :)

Neil Henning (@sheredom) asked:

SPU gurus of twitter unite, want a vector unsigned int with {1, 2, 3, 4} in each slot, without putting it in as elf constant, any ideas?

Interesting question. The SPU ISA generally doesn’t help build vectors with different values in each slot. In this case, there are only very small values required in each register, so it can be done with a neat little trick.

My answer:

    fsmbi r4, 0x7310  # r4 = {0x00ffffff, 0x0000ffff, 0x000000ff, 0x00000000}
    clz r5, r4        # r5 = {8,16,24,32}
    rotmi r6, r5, -3  # r6 = {1,2,3,4}

Instructions are:

  • fsmbi — form select mask byte immediate. Creates a 128 bit mask from a 16 bit value, expanding each bit of input to 8 bits of output.
  • clz — count leading zeroes. Counts the number of leading zeros in each word.
  • rotmi — rotate and mask word immediate (logical shift right by negative immediate). Shifts each word right by the negation of number of bits specified.

This solution is entirely self contained, required no pre-set state (unlike my first attempt utilising the cbd instruction). In terms of raw instruction size, it’s a whole eight bytes smaller than storing the vector in memory and loading it when needed (that being 16+4 bytes), and a little slower than using a load instruction.

(On a cursory re-examination of the SPU ISA, fsmbi is the only instruction that will construct a different value in each word of a register. A specific pattern may be generated with cbd/cbx that can be used for this problem, but it depends on the contents of another register which limits its already limited usefulness. Combining fsmbi with other immediate instructions allows for a wide range of values to be constructed independent of register state and without access to storage)

Assembly Primer Parts 6 — Moving Data — ARM

My notes for where ARM differs from IA32 in the Assembly Primer video Part 6 — Moving Data.

(There is no separate part 5 post for ARM — apart from the instructions, it’s identical to IA32. There’s even support for the .bss section, unlike SPU and PPC)

Moving Data

We’ll look at MovDemo.s for ARM. First, the storage:

.data

    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

It’s the same as for IA32, PPC and SPU. Like the first two, ARM will cope with the unnatural alignment.

1. Immediate value to register

.text
.globl _start
_start:
    @movl $10, %eax

    mov r0, #10

Move the value 10 into register r0.

Something to note: the ARM assembly syntax has some slightly differences. Where others use # to mark the start of a comment, ARM has @ (although # works at the start of a line). Literal values are prefixed with #, which confuses the default syntax highlighting in vim.

2. Immediate value to memory

    @movw $50, Int16

    mov r1, #50
    movw r0, #:lower16:Int16
    movt r0, #:upper16:Int16
    strh r1, [r0, #0]

We need to load the immediate value in a register (r1), the address in a register (r0) and then perform the write. To quote the Architecture Reference Manual:

The ARM architecture … incorporates … a load/store architecture, where data processing operations only operate on register contents, not directly on memory contents.

which is like PPC and SPU, and unlike IA32 — and so we’ll see similarly verbose alternatives to the IA32 examples from the video.

I’m using movw, movt sequence to load the address, rather than ldr (as mentioned in the previous installment).

strh is, in this case, Store Register Halfword (immediate) — writes the value in r1 to the address computed from the sum of the contents of r0 and the immediate value of 0.

3. Register to register

    @movl %eax, %ebx

    mov r1,r0

mov (Move) copies the value from r0 to r1.

4. Memory to register

    @movl Int32, %eax

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

Load the address into r0, load from the address r0+0. Here ldr is Load Register (immediate).

5. Register to memory

    @movb $3, %al
    @movb %al, ByteLocation

    mov r0, #3
    movw r1, #:lower16:ByteLocation
    movt r1, #:upper16:ByteLocation
    strb r0, [r1, #0]

Once again the same kind of thing — load 3 into r0, the address of ByteLocation into r1, perform the store.

6. Register to indexed memory location

    @movl $0, %ecx
    @movl $2, %edi
    @movl $22, IntegerArray(%ecx, %edi, 4)

    movw r0, #:lower16:IntegerArray
    movt r0, #:upper16:IntegerArray
    mov r1, #2
    mov r2, #22
    str r2, [r0, r1, lsl #2]

A little more interesting — here str is Store Register (register) which accepts two registers and an optional shift operation and amount. Here lsl is logical shift left, effectively multiplying r1 by 4 — the size of the array elements.

(GCC puts asl here. Presumably identical to logical shift left, but there’s no mention of asl in the Architecture Reference Manual. Update: ASL is referenced in the list of errors here as an obsolete name for LSL)

Two source registers and a shift is still shy of IA32’s support for an calculating an address from a base address, two registers and a multiply.

7. Indirect addressing

    @movl $Int32, %eax
    @movl (%eax), %ebx

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

    @movl $9, (%eax)

    mov r2, #9
    str r2, [r0, #0]

More of the same.

Concluding thoughts

In addition to the cases above, ARM has a number of other interesting addressing modes that I shall consider in more detail in the future — logical operations, auto-{increment, decrement} and multiples. Combined with conditional execution, there are some very interesting possibilities.

Other assembly primer notes are linked here.

Assembly Primer Part 4 — Hello World — ARM

On to Assembly Primer — Part 4. This is where we start writing a small assembly program for the platform. In this case, I don’t know the language and I don’t know the ABI. Learning these from scratch ranges from interesting to tedious :)

Regarding the language (available instructions, mnemonics and assembly syntax): I’m using the ARM Architecture Reference Manual as my reference for the architecture (odd, I know). It’s very long and the documentation for each instruction is extensive — which is good because there are a lot of instructions, and many of them do a lot of things at once.

Regarding the ABI (particularly things like argument passing, return values and system calls): there’s the Procedure Call Standard for the ARM Architecture, and there are a few other references I’ve found, such as the Debian ARM EABI Port wiki page.

“EABI is the new “Embedded” ABI by ARM ltd. EABI is actually a family of ABI’s and one of the “subABIs” is GNU EABI, for Linux.”

– from Debian ARM EABI Port

System Calls

To perform a system call using the GNU EABI:

  • put the system call number in r7
  • put the arguments in r0-r6 (64bit arguments must be aligned to an even numbered register i.e. in r0+r1, r2+r3, or r4+r5)
  • issue the Supervisor Call instruction with a zero operand — svc #0

(Supervisor Call was previously named Software Interruptswi)

Just Exit

Based on the above, it’s not difficult to reimplement JustExit.s (original) for ARM.

.text

.globl _start

_start:
        mov r7, #1
        mov r0, #0
        svc #0

mov here is Move (Immediate) which puts the #-prefixed literal into the named register.

Hello World

Likewise, the conversion of HelloWorldProgram.s (original) is not difficult:

.data 

HelloWorldString:
      .ascii "Hello World\n"

.text 

.globl _start 

_start:
      # Load all the arguments for write () 

      mov r7, #4
      mov r0, #1
      ldr r1,=HelloWorldString
      mov r2, #12
      svc #0

      # Need to exit the program 

      mov r7, #1
      mov r0, #0
      svc #0

This includes the load register pseudo-instruction, ldr — the compiler stores the address of HelloWorldString into the literal pool, a portion of memory located in the program text, and the 32bit address is loaded from the literal pool (more details).

When compiling a similar C program with -mcpu=cortex-a8, I notice that the compiler generates Move (immediate) and Move Topmovw and movt — instructions to load the address directly from the instruction stream, which is presumably more efficient on that architecture.

Assembly Primer Parts 1, 2 and 3 — ARM

I had started a series of posts on assembly programming for the Cell BE PPU and SPU, based on the assembly primer video series from securitytube.net. I have recently acquired a Nokia N900, and so thought I might take the opportunity to continue the series with a look at the ARM processor as well.

Wikipedia lists the N900’s processor as a Texas Instruments OMAP3430, 600MHz ARMv7 Cortex-A8. I’m not at all familiar with the processor family, so I’ll be attempting to find out what all of this means as I go :P

I’ve set up a cross compiler on my desktop machine using Gentoo’s neat crossdev tool (built using crossdev -t arm-linux-gnueabi). The toolchain builds a functional Hello, World!

(I note that scratchbox appears to be the standard tool/environment used to build apps for Maemo — I may take a closer look at that at a later date)

I have whatever the latest public ‘stable’ Maemo 5 release is on the N900 (PR 1.3, I think),  with an apt-get install openssh gdb — thus far, enough to “debug” a functional Hello, World!

What follows are some details of the Cortex-A8 architecture present in the N900, particularly in how it differs from IA32, as presented in the videos Part 1 — System Organisation, Part 2 — Virtual Memory Organization and Part 3 — GDB Usage Primer. I’ve packed them all into this post because gdb usage and Linux system usage are largely the same on ARM as they are on PPC and IA32.

Most of the following information comes from the ARM Architecture Reference Manual.

(The number of possible configurations of ARM hardware makes it interesting at times to work out exactly which features are present in my particular processor. From what I can tell, the N900’s Cortex-A8 is ARMv7-A and includes VFPv3 half, single and double precision float support, and NEON (aka Advanced SIMD). I expect I’ll find out more when I actually start to try and program the thing. As to which gcc -march, -mcpu or -mfpu options are most correct for the N900 — I have no idea.)

1. Registers

Integer

There are sixteen 32bit ARM core registers, R0 to R15, where R0–R12 are for general use. R13 contains the stack pointer (SP), R14 is the link register (LR), and R15 is the program counter (PC).

The current program status register (CSPR) contains various status and control bits.

VFPv3 (Floating point) & NEON (Advanced SIMD)

There are thrirty two doubleword (64bit) registers, that can be referenced in a number of ways.

NEON instructions can access these as thirty two doubleword registers (D0–D31) or as sixteen quadword registers (Q0–Q15), able to be used interchangeably.

VFP instructions can view the same registers as 32 doubleword registers (again, D0–D31) or as 32 single word registers (S0–S31). The single word view is packed into the first 16 doubleword registers.

Something like this pic (click to embiggen):

VFP in this core (apparently) supports single and double precision floating point data types and arithmetic, as well as half precision (possibly in two different formats…).

NEON instructions support accessing values in extension registers as

  • 8, 16, 32 or 64bit integer, signed or unsigned,
  • 16 or 32bit floating point values, and
  • 8 or 16bit polynomial values.

There’s also a floating point status and control register (FPSCR).

2. Virtual Memory Organisation

On this platform, program text appears to be loaded at 0x8000.

After an echo 0 > /proc/sys/kernel/randomize_va_space, the top of the stack appears to be 0xbf000000.

3. SimpleDemo

Compared to the video, there are only a couple of small differences when running SimpleDemo in gdb on ARM.

Obviously, the disassembly is not the same as for IA32. Rather than the call instructions noted in the video, you’ll see bl (Branch with Link) for the various functions called.

Where the return address is stored on the stack for IA32, the link register (lr in info registers output) stores the return address for the current function, although lr will be pushed to the stack before another function is called.

(From a cursory googling, it seems that to correctly displaying all VFP/NEON registers requires gdb-7.2 — I’m running the 6.8-based build from the maemo repo. crossdev will build me a gdb I can run on my desktop PC — crossdev -t arm-linux-gnueabi –ex-gdb — but I believe I still need to build a newer gdbserver to run on the N900.)

Other assembly primer notes are linked here.

Proposed updates for praypal.org.au 2011/02/11

Assembly Primer Part 7 — Working with Strings — SPU

These are my notes for where I can see SPU varying from ia32, as presented in the video Part 7 — Working with Strings.

The ia32 instructions covered in this video (MOVSx, LODSx, STOSx, CMPSx, CLD, STD) clearly highlight many of the differences between that arch and SPU:

  • Implied operand registers e.g. MOVSx using %esi and %edi as source and destination addresses.
  • Side effects on operands e.g. incrementing addresses while performing reads/writes
  • Side effects in the FLAGS register e.g. direction flag, zero flag
  • Support for any alignment of data.

I’ve made less effort to re-implement the full functionality of the ia32 instructions for this part.  There’s a couple of cases that might be interesting to attempt to do so, but a fully general case can probably just be lifted from newlib’s memcpy() implementation for SPU.

Again,  this is the quick SPU instruction reference I use, and I still regularly refer to the SPU ISA doc.

Working with Strings

Starting with StringBasics.s, firth there’s some storage:

 1 .data
 2     .align 4
 3     HelloWorldString:
 4         .asciz "Hello World of Assembly!"
 5     .align 4
 6     H3110:
 7         .asciz "H3110"
 8     .align 4
 9     shuf_AABABCDghijklmnop:
 10         .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f
 11 #.bss
 12     .comm Destination, 100, 16
 13     .comm DestinationUsingRep, 100, 16
 14     .comm DestinationUsingStos, 100, 16

Notable changes from the original:

  • Added .align 4 before each label to provide 16 byte alignment
  • There’s a shuffle pattern added that I’ll write more about later
  • .bss is commented out because spu-as doesn’t like it (I’m still not clear on why this is the case)
  • spu-as doesn’t support alignment for .lcomm but does for .comm (why?), so .lcomm has been replaced with .comm and the trailing “, 16” added to each case to provide 16 byte alignment.
    (learned: .align doesn’t have an effect on .comm/.lcomm, and .comm’s alignment isn’t power of 2, à la .align — there’s nothing hard about assembly programming, really :\)

1. Simple copying using movsb, movsw, movsl

 26         #movl $HelloWorldString, %esi
 27         #movl $Destination, %edi
 28
 29         ila $5,HelloWorldString
 30         ila $6,Destination
 31
 32         #reverse order of instructions to avoid even sillier alignment hassle
 33         #movsw
 34         lqd $7,0($5)
 35         lqd $8,0($6)
 36         rotqby $10,$7,$9
 37         cwd $11,0($6)
 38         shufb $12,$10,$8,$11
 39         stqd $12,0($6)
 40         ai $5,$5,4
 41         ai $6,$6,4
 42
 43         #movsl
 44         lqd $7,0($5)
 45         lqd $8,0($6)
 46         ai $9,$5,-2     # rotate needed to get val into pref slot
 47         rotqby $10,$7,$9
 48         chd $11,0($6)
 49         shufb $12,$10,$8,$11 # shuffle byte into dest
 50         stqd $12,0($6)
 51         ai $5,$5,2
 52         ai $6,$6,2
 53
 54         #movsb
 55         lqd $7,0($5)
 56         lqd $8,0($6)
 57         ai $9,$5,-3     # rotate needed to get val into pref slot
 58         rotqby $10,$7,$9
 59         cbd $11,0($6)
 60         shufb $12,$10,$8,$11 # shuffle byte into dest
 61         stqd $12,0($6)
 62         ai $5,$5,1
 63         ai $6,$6,1

Of the examples in this part, this is my biggest attempt at a “complete” implementation of the ia32 instructions, and even then it’s built on the assumption of natural alignment (of words and halfwords) not present in the ia32 code.  Lots of effort to achieve some simple tasks.

Making the extra assumption that the alignment of source and destination match, the three MOVS instructions can be combined and simplified to something like:

 66         lqa $13,HelloWorldString
 67         lqa $14,Destination
 68         fsmbi $15,0x01ff
 69         selb $16,$13,$14,$15 # copy only desired bytes into destination vector
 70         stqa $14,Destination

With the further assumption that the destination was able to be trashed entirely, this could be reduced to just a load and a store.

2. Setting / clearing the DF flag

There is no DF flag.  It could be simulated through the use of an offset stored in another register that is added to addresses using lqx & stqx instructions, which would achieve the same kind of functionality.

3. Using Rep

REP is a fascinating little instruction (modifier? Appears to be a single byte in disassembly), but there’s no direct equivalent for the SPU.  It could be mimicked in a general way on SPU using branches, but branches are a topic of later parts in the series so I’ll avoid that for now.

Instead, because the length and alignments of source and destination are known, it can be (effectively) unrolled and branchless :)

 85         #movl $HelloWorldString, %esi
 86         #movl $DestinationUsingRep, %edi
 87         #movl $25, %ecx # set the string length in ECX
 88         #cld # clear the DF
 89         #rep movsb
 91
 95         lqa $20,HelloWorldString
 96         lqa $21,HelloWorldString+16
 97         lqa $22,DestinationUsingRep+16
 98         fsmbi $23,0x007f
 99         selb $22,$21,$22,$23
100         stqa $20,DestinationUsingRep
101         stqa $22,DestinationUsingRep+16

A simple load and store for the first quadword, and a merge of the second with its destination.  Again, making further assumptions about the destination memory would remove the need for lines 97–99.

4. Loading string from memory into EAX register

106         #leal HelloWorldString, %esi
107         #lodsb
112         ila $24,HelloWorldString
113         lqa $25,HelloWorldString
114         ai $26,$24,-3
115         rotqby $27,$25,$26
116
117         #dec %esi
118         #lodsw
120         ai $28,$24,-2
121         rotqby $29,$25,$28
122
125         #subl $2, %esi # Make ESI point back to the original string
126         #lodsl
128         rotqby $30,$25,$24

Similar to 1. — loading data and rotating into the preferred slot of a register.  Assumptions about source offset and that the loaded data doesn’t span a quadword boundary makes this much simpler than it would otherwise be.

5. Storing strings from EAX to memory

  8     .align 4
  9     shuf_AABABCDghijklmnop:
 10         .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f

133         #leal DestinationUsingStos, %edi
134         #stosb
135         #stosw
136         #stosl
137
141         lqa $31,DestinationUsingStos
142         lqa $32,shuf_AABABCDghijklmnop
143         shufb $33,$30,$31,$32
144         stqa $33,DestinationUsingStos

Rather than more repetitive merging and messing about, I chose to combine the three stores and get the same effect from a single shuffle — which includes merging with the contents of the destination quadword.

(The shuffle name reveals the intended function: capital letters refer to bytes from the first register, lower-case from the second.  I picked up this scheme from Insomniac Games’s R&D pages.)

6. Comparing strings

149         #cld
150         #leal HelloWorldString, %esi
151         #leal H3110, %edi
152         #cmpsb
153
154         lqa $34,HelloWorldString
155         lqa $35,H3110
156         ceqb $36,$34,$35

There’s no byte-subtraction instruction for SPU, but ceqb will compare the bytes in two registers for equality.  That should be enough to work out if two strings match, but doesn’t give the kind of ordering that you’ll get from strcmp().   Getting the result from ceqb and making some kind of use of it may require shifts, rotates or other shenanigans.

Jaymin Kessler has a post on scanning through pixel values that is also relevant for a number of string manipulation problems.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU
Part 5 — Data Types — PPC & SPU
Part 6 — Moving Data — PPCSPU

Assembly Primer Part 6 — Moving Data — PPC

These are my notes for where I can see PPC varying from ia32, as presented in the video Part 6 — Moving Data.

There are notable differences between PPC and ia32 when moving/copying data around, although not as much as is the case with SPU — PPC copes with non-natural alignments (although I’m not sure what performance penalties there are on Cell or modern ia32 arches for doing so — at the very least, the cost of the occasional extra cache line), but doesn’t have the full range of mov instructions supported by ia32.

(When approaching this part, I wrote the SPU version first because I’ve had a lot more experience with that arch and I though it would be quicker. I was wrong.)

Moving Data

So, let’s look at MovDemo.s for PPC, piece by piece.

First, the storage:

# Demo program to show how to use Data types and MOVx instructions 

.data
    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

Same as for ia32 and SPU.  PPC will cope with the lack of alignment.

1. Immediate value to register

.text
    .globl _start
    _start:
        #movl $10, %eax
        li 0,10

Simple enough to load a small constant into a register.  Like SPU, there’s extra work required if trying to load more complex values.

To load a full 32-bits of immediate data into a register requires two half-word load instructions for the upper and lower parts (as will be seen for loading addresses).  Loading 64-bit values appears to take five instructions (four immediate loads and a rotate in the middle).  The joys of 32-bit, fixed-length instructions :)

Aside:  li and lis are extended mnemonics that generate addi and addsi instructions — and if the second operand is zero, these instructions use the value zero, not the value in gpr0.  Special case ftw.

Speculating: On SPU, loads from local store take 6 cycles, so it will often be quicker to load a value than to generate it.  On PPC, it would seem that even five instructions will complete much faster than a (potential) L2 cache miss.

2. Immediate value to memory

#movw $50, Int16

li 1,50
lis 2,Int16@ha
addi 2,2,Int16@l
sth 1,0(2)

There’s no instruction to write an immediate value directly to memory — the source for a write must be a register, so we load that first.

The address is loaded in the following two instructions — @l is the lower 16 bits of the address.  @ha refers to the upper 16 bits of the address, where the a indicates the value is “adjusted so that adding the low 16 bits will perform the correct calculation of the address accounting for signed arithmetic” (from here, where these suffixes and are documented).

The halfword is then written to the address stored in gpr2.

3. Register to register

#movl %eax, %ebx
ori 3,0,0

Like SPU, register copy can be done with Or Immediate against zero.

4. Memory to register

#movl Int32, %eax

lis 4,Int32@ha
addi 4,4,Int32@l
lwz 5,0(4)

Easy enough — load the address, load from the address.

5. Register to memory

#movb $3, %al
#movb %al, ByteLocation

li 6,3
lis 7,ByteLocation@ha
addi 7,7,ByteLocation@l
stb 6,0(7)

Again, load the address, store a byte to the address.

6. Register to indexed memory location

#movl $0, %ecx
#movl $2, %edi
#movl $22, IntegerArray(%ecx,%edi , 4)

li 7,2
slwi 8,7,2  # extended mnemonic - rlwinm 8,7,2,0,31-2
lis 9,IntegerArray@ha
addi 9,9,IntegerArray@l
lwzx 10,9,8

Load an element offset, shift to get the byte offset, load the address and use the x-form Load Word to fetch from the (base address + offset).

(The z in that mnemonic refers to zeroing of the upper 32 bits of the 64-bit register.  There appears to be an algebraic load that does sign extension)

7. Indirect addressing

#movl $Int32, %eax
#movl (%eax), %ebx

lis 11,Int32@ha
addi 11,11,Int32@l
lwz 12,0(11)

#movl $9, (%eax)

li 13,9
stw 13,0(11)

More of the same kind of thing because that’s how PPC does loads and stores.

Concluding thoughts

Reasonably straightforward, a bit more limited than ia32 in addressing modes but nothing too surprising.  Particularly compared to SPU.

PPC does appear to have some more interesting load and store instructions that I haven’t tried here — updating index on load/store stands out as something I’d like to take a closer look at.  The PPC rotate and mask instructions look like some mind-bending fun to play with, but that’s something for another time.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU
Part 5 — Data Types — PPC & SPU

Assembly Primer Part 6 — Moving Data — SPU

These are my notes for where I can see SPU varying from ia32, as presented in the video Part 6 — Moving Data.

SPU and ia32 differ significantly when it comes to moving/copying data around, in terms of the ways things can be copied, the alignment of data in memory and the vector nature of SPU registers.

This is the quick SPU instruction reference I use.  The SPU ISA doc is worth having nearby if trying to do silly tricks with SPU instructions.

Moving Data

Lets consider what MovDemo.s might look like for SPU, piece by piece.

First, the storage:

# Demo program to show how to use Data types and MOVx instructions 

.data
    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

Icky.  Not naturally aligned for size, and crossing qword boundaries.  This makes the following code particularly messy, because the SPU really doesn’t like unaligned data.  Oh well, let’s begin.

1. Immediate value to register

    .align 3  # ensure code is aligned after awkwardly arranged data
.text
    .globl _start
    _start:
        #movl $10, %eax
        il $5, 10

Well, that was easy enough.

It does get a little trickier if trying to load more complex values.  To load a full 32-bits of immediate data into a register requires two half-word load  instructions for the upper and lower parts.

Loading arbitrary values spanning multiple words is more complex, and is often able to be done simply by storing the constant in .rodata and loading it when needed, although fsmbi can be useful on occasion.

2. Immediate value to local store

#movw $50, Int16

# Write first byte
ila $6,Int16        # load the address of the target
lqa $7,Int16        # load the qword
il $8,0             # load the upper byte of the constant into preferred slot
cbd $9,0($6)        # insertion mask for higher byte
shufb $10,$8,$7,$9  # shuffle the upper byte into the qword
stqa $10,Int16      # write the word back
# Write second byte
il $11,1            # load 1 for address offsetting
cbd $12,1($6)       # insertion mask for lower byte
il $13,50           # load the lower byte of the constant into preferred slot
lqx $14,$6,$11      # load Int16+1
shufb $15,$13,$14,$12 # shuffle lower byte into qword
stqx $15,$6,$11     # write back qword

Writing two bytes to a non-aligned memory location is a messy business.

  • All loads and stores transfer 16 bytes of data between registers and 16-byte aligned memory locations.
  • The chd instruction, used to generate masks to be used to shuffle halfwords into a quadword assumes that the halfword is aligned in the first place.  As it is not, and could be crossing a quadword boundary, I write it here as two bytes.

The lqa instruction (a-form) is used for the first load, to fetch from an absolute local store address.  lqx (x-form) is used for the second load, because the address needs to be offset by one — on first glance, lqa and lqd appeared to be better choices, but these do not store all of the lower bits of the immediate part to be used, which would have prevented the one-byte offset.  So x-form is used as the zeroing of lower bits happens after the addition of the preferred word slots of two registers.

I suspect that this can be done better.

(It’s worth taking a look at Jaymin’s post on write-combining which looks more deeply at this kind of problem)

3. Register to registers

#movl %eax, %ebx
ori $16, $5, 0

Using the Or Immediate instruction to perform a bitwise Or against zero to perform the copy.  This can be achieved using the odd pipeline with a quadword shift or rotate immediate instruction.

Copying smaller portions of a register will require extra instruction(s) for masking or rotation, depending on what exactly needs to be copied and to where.

4. Local store to register

#movl Int32, %eax

ila $16,Int32       # load the address
lqa $17,Int32       # load the vector containing the first part
lqa $18,Int32+3     # load the vector containing the second part
fsmbi $19,0xf000    # create a mask for merging them
rotqby $20,$17,$16  # rotate the first part into position
rotqby $21,$18,$16  # rotate the second part into position
rotqby $22,$19,$16  # rotate the select mask into position
selb $23,$20,$21,$22 # select things into the right place.

This one is a different class of fun – loading four bytes that span two quadwords.  Using fsmbi to generate a mask that is used to combine bytes from the two quadwords.

Again, I suspect there’s a better way to do it.

5. Register to local store

#movb $3, %al
#movb %al, ByteLocation

ila $24,ByteLocation
lqa $25, ByteLocation
il $26,3
cbd $27, 0($24)
shufb $28, $26, $25, $27
stqa $28, ByteLocation

Essentially the same problem as 2. on the SPU, but a little simpler because it’s only a single byte.

6. Register to indexed memory location

#movl $0, %ecx
#movl $2, %edi
#movl $22, IntegerArray(%ecx,%edi , 4)
il $29,2                # load the index
ila $30,IntegerArray    # load the address of the array
shli $31,$29,2          # shift two left to get byte offset
lqx $32,$30,$31         # load from the sum of the two addresses
# and then write the data to memory...

This is an attempt at a comparable indexed access to local store.  All I’ve done here is the address calculation and load — writing the value is a mess because it’s not aligned and spans two quadwords, so something like that done in 2. would be required.

7. Indirect addressing

movl $Int32, %eax
movl (%eax), %ebx

movl $9, (%eax)

I’ve done these before (essentially) in 2. and 4.

Concluding thoughts

Align your data for the SPU.  This would have all been much, much simpler (and not much of a challenge) if the data was aligned and variables were in preferred slots.  I suspect I’ll simplify the later parts of the series for SPU by aligning the data first.

I found some useful fragments amongst the Introduction to SPU Optimizations presentations from Naughty Dog — they’re a very good read: Part 1 & Part 2.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU
Part 5 — Data Types — PPC & SPU

Assembly Primer Part 5 — Data Types

These are my notes for where I can see both PPC and SPU varying from ia32, as presented in the video Part 5 — Data Types.  There’s not a lot to be said about this one, so there’s just the one post for both PPC and SPU architectures.

The main problem with assembling the provided VariableDemo.s is that gas doesn’t seem to like the .bss section for either PPC or SPU, producing an error.  To be able to assemble this file on these architectures, I removed the .bss line and (obviously) removed (replaced) the ia32 asm instructions.  objdump shows that “.comm LargeBuffer 10000” is placed in .bss, as intended.

(At this point, I’m quite out of my depth as to why this difference between the architectures exists — if someone can enlighten me, that’d be great :)

I was interested to see that gdb has no problem accessing the unaligned variables on the SPU.  It’s worth noting that the assembler is quite happy to let you place data wherever you like (with great power comes great etc.).  And I think I need to take a closer look at the .align directive.

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU

Assembly Primer Part 4 — Hello World — SPU

These are my notes for where I can see SPU varying from ia32, as presented in the video Part 4 — Hello World.

I’ve written about syscalls on SPU before, here.  System calls can be performed using appropriately packed data and stopcode 0x2104, which is intercepted by the kernel.

The JustExit.s example in the video uses the exit syscall, which is explicitly excluded from being called by the SPU.  For the sake of the example, we can use the time syscall instead, and so a simple syscall looks something like this:

.data
syscall_time:
    .quad 13 # syscall number and return value
    .quad 0  # parameters
    .quad 0
    .quad 0
    .quad 0
    .quad 0
    .quad 0

.text
.globl _start
_start:
    stop 0x2104
    .int syscall_time

syscall_time is the structure used by the syscall (see struct spu_syscall_block and __linux_syscall()) with 13 in the first unsigned long long. (“Obviously”, .quad is 8 bytes :\).  If there were arguments to pass to the syscall, they would be placed in the six following .quads.

The address of the syscall block must follow directly after the stop instruction.  (I did wonder if there would be some trick to mixing the address with the program code — as you can see, no trick needed)

The syscall’s return value is placed in the first 8 bytes of the syscall block.

While it’s possible to use the write syscall, it’s rather painful as it requires a valid char* ea to be passed to be written, which is not readily accessible from the SPU.  The alternative is to use the __send_to_ppe() function — write() is one of the POSIX1 functions handled by newlib+libspe.  Of course, it has a slightly different calling mechanism to to __linux_syscall(), uses the JSRE_POSIX1 stopcode of 0x2101.  This works:

.data

HelloWorldString:
    .ascii "Hello world\n"

send_to_ppe_write:
    .int 1 # stdout
    .int 0 # pad
    .int 0 # pad
    .int 0 # pad
    .int HelloWorldString # char* in local store
    .int 0 # pad
    .int 0 # pad
    .int 0 # pad
    .int 12 # length
    .int 0 # pad
    .int 0 # pad
    .int 0 # pad

.text
.globl _start

_start:
    stop 0x2101
    .int send_to_ppe_write+0x1b000000

Although I’m sure it can be expressed more elegantly.

The magic number 0x1b000000 added to the address of send_to_ppe_write is derived from the “combined” variable from __send_to_ppe() with the values from jsre.h.

Alternatively…

I just realised that this works: if /proc/sys/kernel/randomize_va_space is zero, the address of the mapped SPU LS (from /proc/$PID/maps, as seen here) of 0xf7f70000 can be used as an offset to anything in local store, so the syscall will work with offset pointers:

.data

HelloWorldString:
    .ascii "Hello World\n"

syscall_write:
    .quad 4 # write
    .quad 1 # stdout
    .int  0 # gas doesn't seem to like doing the arithmetic in a .quad
    .int 0xf7f70000 + HelloWorldString # mapped address of string
    .quad 12 # length
    .quad 0
    .quad 0
    .quad 0

.text

.globl _start

_start:
    stop 0x2104
    .int syscall_write

Hideous :)

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU