These are my notes for where I can see SPU varying from ia32, as presented in the video Part 6 — Moving Data.
SPU and ia32 differ significantly when it comes to moving/copying data around, in terms of the ways things can be copied, the alignment of data in memory and the vector nature of SPU registers.
This is the quick SPU instruction reference I use. The SPU ISA doc is worth having nearby if trying to do silly tricks with SPU instructions.
Moving Data
Lets consider what MovDemo.s might look like for SPU, piece by piece.
First, the storage:
# Demo program to show how to use Data types and MOVx instructions .data HelloWorld: .ascii "Hello World!" ByteLocation: .byte 10 Int32: .int 2 Int16: .short 3 Float: .float 10.23 IntegerArray: .int 10,20,30,40,50
Icky. Not naturally aligned for size, and crossing qword boundaries. This makes the following code particularly messy, because the SPU really doesn’t like unaligned data. Oh well, let’s begin.
1. Immediate value to register
.align 3 # ensure code is aligned after awkwardly arranged data .text .globl _start _start: #movl $10, %eax il $5, 10
Well, that was easy enough.
It does get a little trickier if trying to load more complex values. To load a full 32-bits of immediate data into a register requires two half-word load instructions for the upper and lower parts.
Loading arbitrary values spanning multiple words is more complex, and is often able to be done simply by storing the constant in .rodata and loading it when needed, although fsmbi can be useful on occasion.
2. Immediate value to local store
#movw $50, Int16 # Write first byte ila $6,Int16 # load the address of the target lqa $7,Int16 # load the qword il $8,0 # load the upper byte of the constant into preferred slot cbd $9,0($6) # insertion mask for higher byte shufb $10,$8,$7,$9 # shuffle the upper byte into the qword stqa $10,Int16 # write the word back # Write second byte il $11,1 # load 1 for address offsetting cbd $12,1($6) # insertion mask for lower byte il $13,50 # load the lower byte of the constant into preferred slot lqx $14,$6,$11 # load Int16+1 shufb $15,$13,$14,$12 # shuffle lower byte into qword stqx $15,$6,$11 # write back qword
Writing two bytes to a non-aligned memory location is a messy business.
- All loads and stores transfer 16 bytes of data between registers and 16-byte aligned memory locations.
- The chd instruction, used to generate masks to be used to shuffle halfwords into a quadword assumes that the halfword is aligned in the first place. As it is not, and could be crossing a quadword boundary, I write it here as two bytes.
The lqa instruction (a-form) is used for the first load, to fetch from an absolute local store address. lqx (x-form) is used for the second load, because the address needs to be offset by one — on first glance, lqa and lqd appeared to be better choices, but these do not store all of the lower bits of the immediate part to be used, which would have prevented the one-byte offset. So x-form is used as the zeroing of lower bits happens after the addition of the preferred word slots of two registers.
I suspect that this can be done better.
(It’s worth taking a look at Jaymin’s post on write-combining which looks more deeply at this kind of problem)
3. Register to registers
#movl %eax, %ebx ori $16, $5, 0
Using the Or Immediate instruction to perform a bitwise Or against zero to perform the copy. This can be achieved using the odd pipeline with a quadword shift or rotate immediate instruction.
Copying smaller portions of a register will require extra instruction(s) for masking or rotation, depending on what exactly needs to be copied and to where.
4. Local store to register
#movl Int32, %eax ila $16,Int32 # load the address lqa $17,Int32 # load the vector containing the first part lqa $18,Int32+3 # load the vector containing the second part fsmbi $19,0xf000 # create a mask for merging them rotqby $20,$17,$16 # rotate the first part into position rotqby $21,$18,$16 # rotate the second part into position rotqby $22,$19,$16 # rotate the select mask into position selb $23,$20,$21,$22 # select things into the right place.
This one is a different class of fun – loading four bytes that span two quadwords. Using fsmbi to generate a mask that is used to combine bytes from the two quadwords.
Again, I suspect there’s a better way to do it.
5. Register to local store
#movb $3, %al #movb %al, ByteLocation ila $24,ByteLocation lqa $25, ByteLocation il $26,3 cbd $27, 0($24) shufb $28, $26, $25, $27 stqa $28, ByteLocation
Essentially the same problem as 2. on the SPU, but a little simpler because it’s only a single byte.
6. Register to indexed memory location
#movl $0, %ecx #movl $2, %edi #movl $22, IntegerArray(%ecx,%edi , 4) il $29,2 # load the index ila $30,IntegerArray # load the address of the array shli $31,$29,2 # shift two left to get byte offset lqx $32,$30,$31 # load from the sum of the two addresses # and then write the data to memory...
This is an attempt at a comparable indexed access to local store. All I’ve done here is the address calculation and load — writing the value is a mess because it’s not aligned and spans two quadwords, so something like that done in 2. would be required.
7. Indirect addressing
movl $Int32, %eax movl (%eax), %ebx movl $9, (%eax)
I’ve done these before (essentially) in 2. and 4.
Concluding thoughts
Align your data for the SPU. This would have all been much, much simpler (and not much of a challenge) if the data was aligned and variables were in preferred slots. I suspect I’ll simplify the later parts of the series for SPU by aligning the data first.
I found some useful fragments amongst the Introduction to SPU Optimizations presentations from Naughty Dog — they’re a very good read: Part 1 & Part 2.
Previous assembly primer notes…
Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPC — SPU
Part 5 — Data Types — PPC & SPU