Assembly Primer Part 6 — Moving Data

These are my notes for where I can see PPC varying from ia32, as presented in the video Part 6 — Moving Data.

There are notable differences between PPC and ia32 when moving/copying data around, although not as much as is the case with SPU — PPC copes with non-natural alignments (although I’m not sure what performance penalties there are on Cell or modern ia32 arches for doing so — at the very least, the cost of the occasional extra cache line), but doesn’t have the full range of mov instructions supported by ia32.

(When approaching this part, I wrote the SPU version first because I’ve had a lot more experience with that arch and I though it would be quicker. I was wrong.)

Moving Data

So, let’s look at MovDemo.s for PPC, piece by piece.

First, the storage:

# Demo program to show how to use Data types and MOVx instructions 

.data
    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

Same as for ia32 and SPU. PPC will cope with the lack of alignment.

1. Immediate value to register

.text
    .globl _start
    _start:
        #movl $10, %eax
        li 0,10

Simple enough to load a small constant into a register. Like SPU, there’s extra work required if trying to load more complex values.

To load a full 32-bits of immediate data into a register requires two half-word load instructions for the upper and lower parts (as will be seen for loading addresses). Loading 64-bit values appears to take five instructions (four immediate loads and a rotate in the middle). The joys of 32-bit, fixed-length instructions :)

Aside: li and lis are extended mnemonics that generate addi and addsi instructions — and if the second operand is zero, these instructions use the value zero, not the value in gpr0. Special case ftw.

Speculating: On SPU, loads from local store take 6 cycles, so it will often be quicker to load a value than to generate it. On PPC, it would seem that even five instructions will complete much faster than a (potential) L2 cache miss.

2. Immediate value to memory

#movw $50, Int16

li 1,50
lis 2,Int16@ha
addi 2,2,Int16@l
sth 1,0(2)

There’s no instruction to write an immediate value directly to memory — the source for a write must be a register, so we load that first.

The address is loaded in the following two instructions — @l is the lower 16 bits of the address. @ha refers to the upper 16 bits of the address, where the a indicates the value is “adjusted so that adding the low 16 bits will perform the correct calculation of the address accounting for signed arithmetic” (from here, where these suffixes and are documented).

The halfword is then written to the address stored in gpr2.

3. Register to register

#movl %eax, %ebx
ori 3,0,0

Like SPU, register copy can be done with Or Immediate against zero.

4. Memory to register

#movl Int32, %eax

lis 4,Int32@ha
addi 4,4,Int32@l
lwz 5,0(4)

Easy enough — load the address, load from the address.

5. Register to memory

#movb $3, %al
#movb %al, ByteLocation

li 6,3
lis 7,ByteLocation@ha
addi 7,7,ByteLocation@l
stb 6,0(7)

Again, load the address, store a byte to the address.

6. Register to indexed memory location

#movl $0, %ecx
#movl $2, %edi
#movl $22, IntegerArray(%ecx,%edi , 4)

li 7,2
slwi 8,7,2  # extended mnemonic - rlwinm 8,7,2,0,31-2
lis 9,IntegerArray@ha
addi 9,9,IntegerArray@l
lwzx 10,9,8

Load an element offset, shift to get the byte offset, load the address and use the x-form Load Word to fetch from the (base address + offset).

(The z in that mnemonic refers to zeroing of the upper 32 bits of the 64-bit register. There appears to be an algebraic load that does sign extension)

7. Indirect addressing

#movl $Int32, %eax
#movl (%eax), %ebx

lis 11,Int32@ha
addi 11,11,Int32@l
lwz 12,0(11)

#movl $9, (%eax)

li 13,9
stw 13,0(11)

More of the same kind of thing because that’s how PPC does loads and stores.

Concluding thoughts

Reasonably straightforward, a bit more limited than ia32 in addressing modes but nothing too surprising. Particularly compared to SPU.

PPC does appear to have some more interesting load and store instructions that I haven’t tried here — updating index on load/store stands out as something I’d like to take a closer look at. The PPC rotate and mask instructions look like some mind-bending fun to play with, but that’s something for another time.

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPC — SPU
Part 5 — Data Types — PPC & SPU

Assembly Primer Part 6 — Moving Data — PPC