Assembly Primer Part 6 — Moving Data — SPU

These are my notes for where I can see SPU varying from ia32, as presented in the video Part 6 — Moving Data.

SPU and ia32 differ significantly when it comes to moving/copying data around, in terms of the ways things can be copied, the alignment of data in memory and the vector nature of SPU registers.

This is the quick SPU instruction reference I use.  The SPU ISA doc is worth having nearby if trying to do silly tricks with SPU instructions.

Moving Data

Lets consider what MovDemo.s might look like for SPU, piece by piece.

First, the storage:

# Demo program to show how to use Data types and MOVx instructions 

.data
    HelloWorld:
        .ascii "Hello World!"

    ByteLocation:
        .byte 10

    Int32:
        .int 2
    Int16:
        .short 3
    Float:
        .float 10.23

    IntegerArray:
        .int 10,20,30,40,50

Icky.  Not naturally aligned for size, and crossing qword boundaries.  This makes the following code particularly messy, because the SPU really doesn’t like unaligned data.  Oh well, let’s begin.

1. Immediate value to register

    .align 3  # ensure code is aligned after awkwardly arranged data
.text
    .globl _start
    _start:
        #movl $10, %eax
        il $5, 10

Well, that was easy enough.

It does get a little trickier if trying to load more complex values.  To load a full 32-bits of immediate data into a register requires two half-word load  instructions for the upper and lower parts.

Loading arbitrary values spanning multiple words is more complex, and is often able to be done simply by storing the constant in .rodata and loading it when needed, although fsmbi can be useful on occasion.

2. Immediate value to local store

#movw $50, Int16

# Write first byte
ila $6,Int16        # load the address of the target
lqa $7,Int16        # load the qword
il $8,0             # load the upper byte of the constant into preferred slot
cbd $9,0($6)        # insertion mask for higher byte
shufb $10,$8,$7,$9  # shuffle the upper byte into the qword
stqa $10,Int16      # write the word back
# Write second byte
il $11,1            # load 1 for address offsetting
cbd $12,1($6)       # insertion mask for lower byte
il $13,50           # load the lower byte of the constant into preferred slot
lqx $14,$6,$11      # load Int16+1
shufb $15,$13,$14,$12 # shuffle lower byte into qword
stqx $15,$6,$11     # write back qword

Writing two bytes to a non-aligned memory location is a messy business.

  • All loads and stores transfer 16 bytes of data between registers and 16-byte aligned memory locations.
  • The chd instruction, used to generate masks to be used to shuffle halfwords into a quadword assumes that the halfword is aligned in the first place.  As it is not, and could be crossing a quadword boundary, I write it here as two bytes.

The lqa instruction (a-form) is used for the first load, to fetch from an absolute local store address.  lqx (x-form) is used for the second load, because the address needs to be offset by one — on first glance, lqa and lqd appeared to be better choices, but these do not store all of the lower bits of the immediate part to be used, which would have prevented the one-byte offset.  So x-form is used as the zeroing of lower bits happens after the addition of the preferred word slots of two registers.

I suspect that this can be done better.

(It’s worth taking a look at Jaymin’s post on write-combining which looks more deeply at this kind of problem)

3. Register to registers

#movl %eax, %ebx
ori $16, $5, 0

Using the Or Immediate instruction to perform a bitwise Or against zero to perform the copy.  This can be achieved using the odd pipeline with a quadword shift or rotate immediate instruction.

Copying smaller portions of a register will require extra instruction(s) for masking or rotation, depending on what exactly needs to be copied and to where.

4. Local store to register

#movl Int32, %eax

ila $16,Int32       # load the address
lqa $17,Int32       # load the vector containing the first part
lqa $18,Int32+3     # load the vector containing the second part
fsmbi $19,0xf000    # create a mask for merging them
rotqby $20,$17,$16  # rotate the first part into position
rotqby $21,$18,$16  # rotate the second part into position
rotqby $22,$19,$16  # rotate the select mask into position
selb $23,$20,$21,$22 # select things into the right place.

This one is a different class of fun – loading four bytes that span two quadwords.  Using fsmbi to generate a mask that is used to combine bytes from the two quadwords.

Again, I suspect there’s a better way to do it.

5. Register to local store

#movb $3, %al
#movb %al, ByteLocation

ila $24,ByteLocation
lqa $25, ByteLocation
il $26,3
cbd $27, 0($24)
shufb $28, $26, $25, $27
stqa $28, ByteLocation

Essentially the same problem as 2. on the SPU, but a little simpler because it’s only a single byte.

6. Register to indexed memory location

#movl $0, %ecx
#movl $2, %edi
#movl $22, IntegerArray(%ecx,%edi , 4)
il $29,2                # load the index
ila $30,IntegerArray    # load the address of the array
shli $31,$29,2          # shift two left to get byte offset
lqx $32,$30,$31         # load from the sum of the two addresses
# and then write the data to memory...

This is an attempt at a comparable indexed access to local store.  All I’ve done here is the address calculation and load — writing the value is a mess because it’s not aligned and spans two quadwords, so something like that done in 2. would be required.

7. Indirect addressing

movl $Int32, %eax
movl (%eax), %ebx

movl $9, (%eax)

I’ve done these before (essentially) in 2. and 4.

Concluding thoughts

Align your data for the SPU.  This would have all been much, much simpler (and not much of a challenge) if the data was aligned and variables were in preferred slots.  I suspect I’ll simplify the later parts of the series for SPU by aligning the data first.

I found some useful fragments amongst the Introduction to SPU Optimizations presentations from Naughty Dog — they’re a very good read: Part 1 & Part 2.

Previous assembly primer notes…

Part 1 — System Organization — PPCSPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPCSPU
Part 5 — Data Types — PPC & SPU