Assembly Primer Part 6 — Moving Data — PPC

These are my notes for where I can see PPC varying from ia32, as presented in the video Part 6 — Moving Data.

There are notable differences between PPC and ia32 when moving/copying data around, although not as much as is the case with SPU — PPC copes with non-natural alignments (although I’m not sure what performance penalties there are on Cell or modern ia32 arches for doing so — at the very least, the cost of the occasional extra cache line), but doesn’t have the full range of mov instructions supported by ia32.

(When approaching this part, I wrote the SPU version first because I’ve had a lot more experience with that arch and I though it would be quicker. I was wrong.)

Moving Data

So, let’s look at MovDemo.s for PPC, piece by piece.

First, the storage:

# Demo program to show how to use Data types and MOVx instructions 

.data
    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

Same as for ia32 and SPU.  PPC will cope with the lack of alignment.

1. Immediate value to register

.text
    .globl _start
    _start:
        #movl $10, %eax
        li 0,10

Simple enough to load a small constant into a register.  Like SPU, there’s extra work required if trying to load more complex values.

To load a full 32-bits of immediate data into a register requires two half-word load instructions for the upper and lower parts (as will be seen for loading addresses).  Loading 64-bit values appears to take five instructions (four immediate loads and a rotate in the middle).  The joys of 32-bit, fixed-length instructions :)

Aside:  li and lis are extended mnemonics that generate addi and addsi instructions — and if the second operand is zero, these instructions use the value zero, not the value in gpr0.  Special case ftw.

Speculating: On SPU, loads from local store take 6 cycles, so it will often be quicker to load a value than to generate it.  On PPC, it would seem that even five instructions will complete much faster than a (potential) L2 cache miss.

2. Immediate value to memory

#movw $50, Int16

li 1,50
lis 2,Int16@ha
addi 2,2,Int16@l
sth 1,0(2)

There’s no instruction to write an immediate value directly to memory — the source for a write must be a register, so we load that first.

The address is loaded in the following two instructions — @l is the lower 16 bits of the address.  @ha refers to the upper 16 bits of the address, where the a indicates the value is “adjusted so that adding the low 16 bits will perform the correct calculation of the address accounting for signed arithmetic” (from here, where these suffixes and are documented).

The halfword is then written to the address stored in gpr2.

3. Register to register

#movl %eax, %ebx
ori 3,0,0

Like SPU, register copy can be done with Or Immediate against zero.

4. Memory to register

#movl Int32, %eax

lis 4,Int32@ha
addi 4,4,Int32@l
lwz 5,0(4)

Easy enough — load the address, load from the address.

5. Register to memory

#movb $3, %al
#movb %al, ByteLocation

li 6,3
lis 7,ByteLocation@ha
addi 7,7,ByteLocation@l
stb 6,0(7)

Again, load the address, store a byte to the address.

6. Register to indexed memory location

#movl $0, %ecx
#movl $2, %edi
#movl $22, IntegerArray(%ecx,%edi , 4)

li 7,2
slwi 8,7,2  # extended mnemonic - rlwinm 8,7,2,0,31-2
lis 9,IntegerArray@ha
addi 9,9,IntegerArray@l
lwzx 10,9,8

Load an element offset, shift to get the byte offset, load the address and use the x-form Load Word to fetch from the (base address + offset).

(The z in that mnemonic refers to zeroing of the upper 32 bits of the 64-bit register.  There appears to be an algebraic load that does sign extension)

7. Indirect addressing

#movl $Int32, %eax
#movl (%eax), %ebx

lis 11,Int32@ha
addi 11,11,Int32@l
lwz 12,0(11)

#movl $9, (%eax)

li 13,9
stw 13,0(11)

More of the same kind of thing because that’s how PPC does loads and stores.

Concluding thoughts

Reasonably straightforward, a bit more limited than ia32 in addressing modes but nothing too surprising.  Particularly compared to SPU.

PPC does appear to have some more interesting load and store instructions that I haven’t tried here — updating index on load/store stands out as something I’d like to take a closer look at.  The PPC rotate and mask instructions look like some mind-bending fun to play with, but that’s something for another time.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU
Part 5 — Data Types — PPC & SPU

Assembly Primer Part 5 — Data Types

These are my notes for where I can see both PPC and SPU varying from ia32, as presented in the video Part 5 — Data Types.  There’s not a lot to be said about this one, so there’s just the one post for both PPC and SPU architectures.

The main problem with assembling the provided VariableDemo.s is that gas doesn’t seem to like the .bss section for either PPC or SPU, producing an error.  To be able to assemble this file on these architectures, I removed the .bss line and (obviously) removed (replaced) the ia32 asm instructions.  objdump shows that “.comm LargeBuffer 10000” is placed in .bss, as intended.

(At this point, I’m quite out of my depth as to why this difference between the architectures exists — if someone can enlighten me, that’d be great :)

I was interested to see that gdb has no problem accessing the unaligned variables on the SPU.  It’s worth noting that the assembler is quite happy to let you place data wherever you like (with great power comes great etc.).  And I think I need to take a closer look at the .align directive.

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU

Assembly Primer Part 4 — Hello World — PPC

These are my notes for where I can see PPC varying from ia32, as presented in the video Part 4 — Hello World.  Let me know if I’ve missed something important, obvious or got something wrong.

http://www.ibm.com/developerworks/library/l-ppc/ gives a good starting overview of PPC asm, including syscalls.  The syscall number goes into gpr0 and the args in gpr3 and following, so JustExit.s becomes:

.text
.globl _start

_start:
    li 0,1 # load 1 into reg 0
    li 3,0 # load 0 into reg 3
    sc     # system call

Simple enough.

Modifying the provided HelloWorldProgram.s example (and using the example from the above link) yields

.data

HelloWorldString:
    .ascii "Hello World\n"

.text

.globl _start

_start:
    # Load all the arguments for write ()

    li   0, 4  # syscall number of 4 (write)
    li   3, 1  # filenumber 1 (stdout)
    lis  4, HelloWorldString@ha   # load upper 16 bits of addr
    addi 4, 4, HelloWorldString@l # add lower 16 bits of addr
    li   5, 12 # length of string
    sc

    # exit the program
    li 0,1
    li 3,0
    sc

There’s some subtlety in the @ha and @l high and low parts of addresses that I don’t yet have my head around fully, but I’ll be coming back to this in a later part.

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU

Assembly Primer Part 3 — GDB Usage Primer

These are my notes for where I can see both PPC and SPU varying from ia32, as presented in the video Part 3 — GDB Usage Primer.  The usage of gdb is effectively the same for all three architectures — I’ve noted here some of the differences in the program being debugged.

In the ia32 disassembly of SimpleDemo.c, the call instruction is generated for function calls.

When compiled for PPC, I see bl — branch to address offset from bl instruction, placing the address of the following instruction in the link register (lr).

When compiled for SPU, I see brsl — branch to address offset from brsl instruction, placing the address of the following instruction into the specified register (typically r0, used as link register).

Neither PPC nor SPU pass args on the stack (at least not for two scalar args as for the add function in SimpleDemo.c).  Those values can still be seen as being present on the stack when examining it in gdb.  The reason appears to be that when compiled with no optimisation, a number of registers are pushed to the stack that are not needed.  Compiling at -O1 eliminates the superfluous pushes, so the args are no longer visible there, being present in the appropriate registers when the function is called.

(This document on calling conventions from Intel seems to say that args get passed to functions in regs where possible on ia32 as well… I can see it happening for amd64, not ia32)

As noted above, PPC and SPU store the function return address in the link register (lr or r0), not on the stack.

All three architectures appear to put the return value in a register (eax or r3).

Previous assembly primer notes…

Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU

Assembly Primer Part 1 — System Organization — PPC

I found the videos introducing assembly language here to be of interest as my own understanding and experience with assembly is quite limited.  The videos are focussed on the ia32 architecture and reverse engineering, particularly for security exploits, and while these aspects don’t really excite me, the videos are well made and clear and the “Assembly Primer for Hackers” videos cover general assembly programming and details of machine architecture that are more broadly applicable.

I thought it would be interesting to work from these videos and make some notes on applying the concepts to the Cell BE’s PPU and SPU.

The platform I’m using is Debian Sid on a PS3 (3.15 OtherOS) with the standard system toolchain.

These are my notes for where I can see the PPU varying from the ia32, as presented in the video Part 1 — System Organization.  Let me know if I’ve missed something important, obvious or got something wrong.

For reference, I’m using the PPC Architecture Books (also found in the Cell SDK) and documentation for the PPC64 ABI.

Registers

Branch Processor

  • CR Condition Register — 32-bit. Provides a mechanism for testing (and branching). Eight 4-bit fields.
    • CR0–CR1 — Volatile condition code register fields
    • CR2–CR4 — Nonvolatile condition code register fields
    • CR5–CR7 — Volatile condition code register fields
  • LR Link Register (volatile) — 64-bit.  Can be used to provide branch target address for Branch Conditional to Link Register instruction
  • CTR Count Register (volatile) — 64-bit.  Can be used to hold a loop count that can be decremented during execution of Branch instructions containing appropriately coded BO field.  Also can be used to provide branch target address for the Branch Conditional to Count Register instruction.

Fixed-Pt Processing

  • GPR0–GPR31 — 64-bit General Purpose registers. Byte, halfword, word or doubleword, depending on instruction flags.  Supports byte, halfword, word, doubleword operand fetches and stores to storage.
    • r0 — Volatile register used in function prologs
    • r1 — Stack frame pointer
    • r2 — TOC pointer
    • r3 — Volatile parameter and return value register
    • r4–r10 — Volatile registers used for function parameters
    • r11 — Volatile register used in calls by pointer and as an environment pointer for languages which require one
    • r12 — Volatile register used for exception handling and glink code
    • r13 — Reserved for use as system thread ID
    • r14–r31 — Nonvolatile registers used for local variables
  • XER Fixed-Point Exception Register (volatile) — 64-bit.  Book I, p 32 has the details on this one.
  • MSR Machine State Register — 64-bit.  Defines the state of the processor. See Book III, p 10.

Float-Pt Processing

  • FPR0–FPR31 Floating-Point Registers — 64-bit. Single or double precision, depending on instruction flags.  Supports word and doubleword operand fetches and stores to storage.
    • f0 — Volatile scratch register
    • f1–f4 — Volatile floating point parameter and return value registers
    • f5–f13 — Volatile floating point parameter registers
    • f14–f31 — Nonvolatile registers
  • FPSCR Floading-Point Status and Control Register (volatile) — 32-bit. Status and control bits. See Book1, pp 87–89.

VMX

  • v0–v1 — Volatile scratch registers
  • v2–v13 — Volatile vector parameters registers
  • v14–v19 — Volatile scratch registers
  • v20–v31 — Non-volatile registers
  • vrsave — Non-volatile 32-bit register

Privileged

  • SRR0 & SRR1 Machine Status Save/Restore registers — 64-bit. Used to store machine state when an interrupt occurs.
  • DAR Data Address Register — 64-bit. Set by various interrupts to effective address associated with interrupt.
  • DSISR Data Storage Interrupt Status Register — 32-bit. Set to indicate the cause of various interrupts.
  • SPRG0–SPRG3 Software-use SPRs — 64-bit.  For use by privileged software.
  • CTRL Control Register — 32-bit. Controls an external I/O pin.
  • PVR Processor Version Register — 32-bit. Read-only.
  • PIR Processor Identification Register — 32-bit.

(there’s some hypervisor regs not listed here, and probably others…)

Virtual Memory Model

  • Default/standard location for .text appears to be 0x1000000
  • stack starts at 0xffffffff