Assembly Primer Part 7 — Working with Strings — ARM

These are my notes for where I can see ARM varying from IA32, as presented in the video Part 7 — Working with Strings.

I’ve not remotely attempted to implement anything approximating optimal string operations for this part — I’m just working my way through the examples and finding obvious mappings to the ARM arch (or, at least what seem to be obvious). When I do something particularly stupid, leave a comment and let me know :)

Working with Strings

        .asciz "Hello World of Assembly!"
        .asciz "H3110"

    .lcomm Destination, 100
    .lcomm DestinationUsingRep, 100
    .lcomm DestinationUsingStos, 100

Here’s the storage that the provided example StringBasics.s uses. No changes are required to compile this for ARM.

1. Simple copying using movsb, movsw, movsl

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $Destination, %edi
    movw r1, #:lower16:Destination
    movt r1, #:upper16:Destination

    ldrb r2, [r0], #1
    strb r2, [r1], #1

    ldrh r3, [r0], #2
    strh r3, [r1], #2

    ldr r4, [r0], #4
    str r4, [r1], #4

More visible complexity than IA32, but not too bad overall.

IA32’s movs instructions implicitly take their source and destination addresses from %esi and %edi, and increment/decrement both. Because of ARM’s load/store architecture, separate load and store instructions are required in each case, but there is support for indexing of these registers:

ARM addressing modes

According to ARM A8.5, memory access instructions commonly support three addressing modes:

  • Offset addressing — An offset is applied to an address from a base register and the result is used to perform the memory access. It’s the form of addressing I’ve used in previous parts and looks like [rN, offset]
  • Pre-indexed addressing — An offset is applied to an address from a base register, the result is used to perform the memory access and also written back into the base register. It looks like [rN, offset]!
  • Post-indexed addressing — An address is used as-is from a base register for memory access. The offset is applied and the result is stored back to the base register. It looks like [rN], offset and is what I’ve used in the example above.

2. Setting / Clearing the DF flag

ARM doesn’t have a DF flag (to the best of my understanding). It could perhaps be simulated through the use of two instructions and conditional execution to select the right direction. I’ll look further into conditional execution of instructions on ARM in a later post.

3. Using Rep

ARM also doesn’t appear to have an instruction quite like IA32’s rep instruction. A conditional branch and a decrement will be the long-form equivalent. As branches are part of a later section, I’ll skip them for now.

    @movl $HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    @movl $DestinationUsingRep, %edi
    movw r1, #:lower16:DestinationUsingRep
    movt r1, #:upper16:DestinationUsingRep

    @movl $25, %ecx # set the string length in ECX
    @cld # clear the DF
    @rep movsb

    ldm r0!, {r2,r3,r4,r5,r6,r7}
    ldrb r8, [r0,#0]
    stm r1!, {r2,r3,r4,r5,r6,r7}
    strb r8, [r1,#0]

To avoid conditional branches, I’ll start with the assumption that the string length is known (25 bytes). One approach would be using multiple load instructions, but the load multiple (ldm) instruction makes it somewhat easier for us — one instruction to fetch 24 bytes, and a load register byte (ldrb) for the last one. Using the ! after the source-address register indicates that it should be updated with the address of the next byte after those that have been read.

The storing of the data back to memory is done analogously. Store multiple (stm) writes 6 registers×4 bytes = 24 bytes (with the ! to have the destination address updated). The final byte is written using strb.

4. Loading string from memory into EAX register

    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString

    ldrb r1, [r0, #0]

    @movb $0, %al
    mov r1, #0

    @dec %esi  @ unneeded. equiv: sub r0, r0, #1
    ldrh r1, [r0, #0]

    @movw $0, %ax
    mov r1, #0

    @subl $2, %esi # Make ESI point back to the original string. unneeded. equiv: sub r0, r0, #2
    ldr r1, [r0, #0]

In this section, we are shown how the IA32 lodsb, lodsw and lodsl instructions work. Again, they have implicitly assigned register usage, which isn’t how ARM operates.

So, instead of a simple, no-operand instruction like lodsb, we have a ldrb r1, [r0, #0] loading a byte from the address in r0 into r1. Because I didn’t use post indexed addressing, there’s no need to dec or subl the address after the load. If I were to do so, it could look like this:

    ldrb r1, [r0], #1
    sub r0, r0, #1

    ldrh r1, [r0], #2
    sub r0, r0, #2

    ldr r1, [r0], #4

If you trace through it in gdb, look at how the value in r0 changes after each instruction.

5. Storing strings from EAX to memory

    @leal DestinationUsingStos, %edi
    movw r0, #:lower16:DestinationUsingStos
    movt r0, #:upper16:DestinationUsingStos

    strb r1, [r0], #1
    strh r1, [r0], #2
    str r1, [r0], #4

Same kind of thing as for the loads. Writes the letters in r1 (being “Hell” — leftovers from the previous section) into DestinationUsingStos (the result being “HHeHell”). String processing on little endian architectures has its appeal.

6. Comparing Strings

    @leal HelloWorldString, %esi
    movw r0, #:lower16:HelloWorldString
    movt r0, #:upper16:HelloWorldString
    @leal H3110, %edi
    movw r1, #:lower16:H3110
    movt r1, #:upper16:H3110

    ldrb r2, [r0,#0]
    ldrb r3, [r1,#0]
    cmp r2, r3

    @dec %esi
    @dec %edi
    @not needed because of the addressing mode used

    ldrh r2, [r0,#0]
    ldrh r3, [r1,#0]
    cmp r2, r3

    @subl $2, %esi
    @subl $2, %edi
    @not needed because of the addressing mode used
    ldr r2, [r0,#0]
    ldr r3, [r1,#0]
    cmp r2, r3

Where IA32’s cmps instructions implicitly load through the pointers in %edi and %esi, explicit loads are needed for ARM. The compare then works in pretty much the same way as for IA32, setting condition code flags in the current program status register (cpsr). If you run the above code, and check the status registers before and after execution of the cmp instructions, you’ll see the zero flag set and unset in the same way as is demonstrated in the video.

The condition code flags are:

  • bit 31 — negative (N)
  • bit 30 — zero (Z)
  • bit 29 — carry (C)
  • bit 28 — overflow (V)

There’s other flags in that register — all the details are on page B1-16 and B1-17 in the ARM Architecture Reference Manual.

And with that, I think we’ve made it (finally) to the end of this part for ARM.

Other assembly primer notes are linked here.

Assembly Primer Parts 6 — Moving Data — ARM

My notes for where ARM differs from IA32 in the Assembly Primer video Part 6 — Moving Data.

(There is no separate part 5 post for ARM — apart from the instructions, it’s identical to IA32. There’s even support for the .bss section, unlike SPU and PPC)

Moving Data

We’ll look at MovDemo.s for ARM. First, the storage:


        .ascii "Hello World!"

        .byte 10

        .int 2
        .short 3
        .float 10.23

        .int 10,20,30,40,50

It’s the same as for IA32, PPC and SPU. Like the first two, ARM will cope with the unnatural alignment.

1. Immediate value to register

.globl _start
    @movl $10, %eax

    mov r0, #10

Move the value 10 into register r0.

Something to note: the ARM assembly syntax has some slightly differences. Where others use # to mark the start of a comment, ARM has @ (although # works at the start of a line). Literal values are prefixed with #, which confuses the default syntax highlighting in vim.

2. Immediate value to memory

    @movw $50, Int16

    mov r1, #50
    movw r0, #:lower16:Int16
    movt r0, #:upper16:Int16
    strh r1, [r0, #0]

We need to load the immediate value in a register (r1), the address in a register (r0) and then perform the write. To quote the Architecture Reference Manual:

The ARM architecture … incorporates … a load/store architecture, where data processing operations only operate on register contents, not directly on memory contents.

which is like PPC and SPU, and unlike IA32 — and so we’ll see similarly verbose alternatives to the IA32 examples from the video.

I’m using movw, movt sequence to load the address, rather than ldr (as mentioned in the previous installment).

strh is, in this case, Store Register Halfword (immediate) — writes the value in r1 to the address computed from the sum of the contents of r0 and the immediate value of 0.

3. Register to register

    @movl %eax, %ebx

    mov r1,r0

mov (Move) copies the value from r0 to r1.

4. Memory to register

    @movl Int32, %eax

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

Load the address into r0, load from the address r0+0. Here ldr is Load Register (immediate).

5. Register to memory

    @movb $3, %al
    @movb %al, ByteLocation

    mov r0, #3
    movw r1, #:lower16:ByteLocation
    movt r1, #:upper16:ByteLocation
    strb r0, [r1, #0]

Once again the same kind of thing — load 3 into r0, the address of ByteLocation into r1, perform the store.

6. Register to indexed memory location

    @movl $0, %ecx
    @movl $2, %edi
    @movl $22, IntegerArray(%ecx, %edi, 4)

    movw r0, #:lower16:IntegerArray
    movt r0, #:upper16:IntegerArray
    mov r1, #2
    mov r2, #22
    str r2, [r0, r1, lsl #2]

A little more interesting — here str is Store Register (register) which accepts two registers and an optional shift operation and amount. Here lsl is logical shift left, effectively multiplying r1 by 4 — the size of the array elements.

(GCC puts asl here. Presumably identical to logical shift left, but there’s no mention of asl in the Architecture Reference Manual. Update: ASL is referenced in the list of errors here as an obsolete name for LSL)

Two source registers and a shift is still shy of IA32’s support for an calculating an address from a base address, two registers and a multiply.

7. Indirect addressing

    @movl $Int32, %eax
    @movl (%eax), %ebx

    movw r0, #:lower16:Int32
    movt r0, #:upper16:Int32
    ldr r1, [r0, #0]

    @movl $9, (%eax)

    mov r2, #9
    str r2, [r0, #0]

More of the same.

Concluding thoughts

In addition to the cases above, ARM has a number of other interesting addressing modes that I shall consider in more detail in the future — logical operations, auto-{increment, decrement} and multiples. Combined with conditional execution, there are some very interesting possibilities.

Other assembly primer notes are linked here.

Assembly Primer Part 4 — Hello World — ARM

On to Assembly Primer — Part 4. This is where we start writing a small assembly program for the platform. In this case, I don’t know the language and I don’t know the ABI. Learning these from scratch ranges from interesting to tedious :)

Regarding the language (available instructions, mnemonics and assembly syntax): I’m using the ARM Architecture Reference Manual as my reference for the architecture (odd, I know). It’s very long and the documentation for each instruction is extensive — which is good because there are a lot of instructions, and many of them do a lot of things at once.

Regarding the ABI (particularly things like argument passing, return values and system calls): there’s the Procedure Call Standard for the ARM Architecture, and there are a few other references I’ve found, such as the Debian ARM EABI Port wiki page.

“EABI is the new “Embedded” ABI by ARM ltd. EABI is actually a family of ABI’s and one of the “subABIs” is GNU EABI, for Linux.”

– from Debian ARM EABI Port

System Calls

To perform a system call using the GNU EABI:

  • put the system call number in r7
  • put the arguments in r0-r6 (64bit arguments must be aligned to an even numbered register i.e. in r0+r1, r2+r3, or r4+r5)
  • issue the Supervisor Call instruction with a zero operand — svc #0

(Supervisor Call was previously named Software Interruptswi)

Just Exit

Based on the above, it’s not difficult to reimplement JustExit.s (original) for ARM.


.globl _start

        mov r7, #1
        mov r0, #0
        svc #0

mov here is Move (Immediate) which puts the #-prefixed literal into the named register.

Hello World

Likewise, the conversion of HelloWorldProgram.s (original) is not difficult:


      .ascii "Hello World\n"


.globl _start 

      # Load all the arguments for write () 

      mov r7, #4
      mov r0, #1
      ldr r1,=HelloWorldString
      mov r2, #12
      svc #0

      # Need to exit the program 

      mov r7, #1
      mov r0, #0
      svc #0

This includes the load register pseudo-instruction, ldr — the compiler stores the address of HelloWorldString into the literal pool, a portion of memory located in the program text, and the 32bit address is loaded from the literal pool (more details).

When compiling a similar C program with -mcpu=cortex-a8, I notice that the compiler generates Move (immediate) and Move Topmovw and movt — instructions to load the address directly from the instruction stream, which is presumably more efficient on that architecture.

Assembly Primer Parts 1, 2 and 3 — ARM

I had started a series of posts on assembly programming for the Cell BE PPU and SPU, based on the assembly primer video series from I have recently acquired a Nokia N900, and so thought I might take the opportunity to continue the series with a look at the ARM processor as well.

Wikipedia lists the N900’s processor as a Texas Instruments OMAP3430, 600MHz ARMv7 Cortex-A8. I’m not at all familiar with the processor family, so I’ll be attempting to find out what all of this means as I go :P

I’ve set up a cross compiler on my desktop machine using Gentoo’s neat crossdev tool (built using crossdev -t arm-linux-gnueabi). The toolchain builds a functional Hello, World!

(I note that scratchbox appears to be the standard tool/environment used to build apps for Maemo — I may take a closer look at that at a later date)

I have whatever the latest public ‘stable’ Maemo 5 release is on the N900 (PR 1.3, I think),  with an apt-get install openssh gdb — thus far, enough to “debug” a functional Hello, World!

What follows are some details of the Cortex-A8 architecture present in the N900, particularly in how it differs from IA32, as presented in the videos Part 1 — System Organisation, Part 2 — Virtual Memory Organization and Part 3 — GDB Usage Primer. I’ve packed them all into this post because gdb usage and Linux system usage are largely the same on ARM as they are on PPC and IA32.

Most of the following information comes from the ARM Architecture Reference Manual.

(The number of possible configurations of ARM hardware makes it interesting at times to work out exactly which features are present in my particular processor. From what I can tell, the N900’s Cortex-A8 is ARMv7-A and includes VFPv3 half, single and double precision float support, and NEON (aka Advanced SIMD). I expect I’ll find out more when I actually start to try and program the thing. As to which gcc -march, -mcpu or -mfpu options are most correct for the N900 — I have no idea.)

1. Registers


There are sixteen 32bit ARM core registers, R0 to R15, where R0–R12 are for general use. R13 contains the stack pointer (SP), R14 is the link register (LR), and R15 is the program counter (PC).

The current program status register (CSPR) contains various status and control bits.

VFPv3 (Floating point) & NEON (Advanced SIMD)

There are thrirty two doubleword (64bit) registers, that can be referenced in a number of ways.

NEON instructions can access these as thirty two doubleword registers (D0–D31) or as sixteen quadword registers (Q0–Q15), able to be used interchangeably.

VFP instructions can view the same registers as 32 doubleword registers (again, D0–D31) or as 32 single word registers (S0–S31). The single word view is packed into the first 16 doubleword registers.

Something like this pic (click to embiggen):

VFP in this core (apparently) supports single and double precision floating point data types and arithmetic, as well as half precision (possibly in two different formats…).

NEON instructions support accessing values in extension registers as

  • 8, 16, 32 or 64bit integer, signed or unsigned,
  • 16 or 32bit floating point values, and
  • 8 or 16bit polynomial values.

There’s also a floating point status and control register (FPSCR).

2. Virtual Memory Organisation

On this platform, program text appears to be loaded at 0x8000.

After an echo 0 > /proc/sys/kernel/randomize_va_space, the top of the stack appears to be 0xbf000000.

3. SimpleDemo

Compared to the video, there are only a couple of small differences when running SimpleDemo in gdb on ARM.

Obviously, the disassembly is not the same as for IA32. Rather than the call instructions noted in the video, you’ll see bl (Branch with Link) for the various functions called.

Where the return address is stored on the stack for IA32, the link register (lr in info registers output) stores the return address for the current function, although lr will be pushed to the stack before another function is called.

(From a cursory googling, it seems that to correctly displaying all VFP/NEON registers requires gdb-7.2 — I’m running the 6.8-based build from the maemo repo. crossdev will build me a gdb I can run on my desktop PC — crossdev -t arm-linux-gnueabi –ex-gdb — but I believe I still need to build a newer gdbserver to run on the N900.)

Other assembly primer notes are linked here.

Proposed updates for 2011/02/11