These are my notes for where I can see SPU varying from ia32, as presented in the video Part 7 — Working with Strings.
The ia32 instructions covered in this video (MOVSx, LODSx, STOSx, CMPSx, CLD, STD) clearly highlight many of the differences between that arch and SPU:
- Implied operand registers e.g. MOVSx using %esi and %edi as source and destination addresses.
- Side effects on operands e.g. incrementing addresses while performing reads/writes
- Side effects in the FLAGS register e.g. direction flag, zero flag
- Support for any alignment of data.
I’ve made less effort to re-implement the full functionality of the ia32 instructions for this part. There’s a couple of cases that might be interesting to attempt to do so, but a fully general case can probably just be lifted from newlib’s memcpy() implementation for SPU.
Again, this is the quick SPU instruction reference I use, and I still regularly refer to the SPU ISA doc.
Working with Strings
Starting with StringBasics.s, firth there’s some storage:
1 .data 2 .align 4 3 HelloWorldString: 4 .asciz "Hello World of Assembly!" 5 .align 4 6 H3110: 7 .asciz "H3110" 8 .align 4 9 shuf_AABABCDghijklmnop: 10 .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f 11 #.bss 12 .comm Destination, 100, 16 13 .comm DestinationUsingRep, 100, 16 14 .comm DestinationUsingStos, 100, 16
Notable changes from the original:
- Added .align 4 before each label to provide 16 byte alignment
- There’s a shuffle pattern added that I’ll write more about later
- .bss is commented out because spu-as doesn’t like it (I’m still not clear on why this is the case)
- spu-as doesn’t support alignment for .lcomm but does for .comm (why?), so .lcomm has been replaced with .comm and the trailing “, 16” added to each case to provide 16 byte alignment.
(learned: .align doesn’t have an effect on .comm/.lcomm, and .comm’s alignment isn’t power of 2, à la .align — there’s nothing hard about assembly programming, really :\)
1. Simple copying using movsb, movsw, movsl
26 #movl $HelloWorldString, %esi 27 #movl $Destination, %edi 28 29 ila $5,HelloWorldString 30 ila $6,Destination 31 32 #reverse order of instructions to avoid even sillier alignment hassle 33 #movsw 34 lqd $7,0($5) 35 lqd $8,0($6) 36 rotqby $10,$7,$9 37 cwd $11,0($6) 38 shufb $12,$10,$8,$11 39 stqd $12,0($6) 40 ai $5,$5,4 41 ai $6,$6,4 42 43 #movsl 44 lqd $7,0($5) 45 lqd $8,0($6) 46 ai $9,$5,-2 # rotate needed to get val into pref slot 47 rotqby $10,$7,$9 48 chd $11,0($6) 49 shufb $12,$10,$8,$11 # shuffle byte into dest 50 stqd $12,0($6) 51 ai $5,$5,2 52 ai $6,$6,2 53 54 #movsb 55 lqd $7,0($5) 56 lqd $8,0($6) 57 ai $9,$5,-3 # rotate needed to get val into pref slot 58 rotqby $10,$7,$9 59 cbd $11,0($6) 60 shufb $12,$10,$8,$11 # shuffle byte into dest 61 stqd $12,0($6) 62 ai $5,$5,1 63 ai $6,$6,1
Of the examples in this part, this is my biggest attempt at a “complete” implementation of the ia32 instructions, and even then it’s built on the assumption of natural alignment (of words and halfwords) not present in the ia32 code. Lots of effort to achieve some simple tasks.
Making the extra assumption that the alignment of source and destination match, the three MOVS instructions can be combined and simplified to something like:
66 lqa $13,HelloWorldString 67 lqa $14,Destination 68 fsmbi $15,0x01ff 69 selb $16,$13,$14,$15 # copy only desired bytes into destination vector 70 stqa $14,Destination
With the further assumption that the destination was able to be trashed entirely, this could be reduced to just a load and a store.
2. Setting / clearing the DF flag
There is no DF flag. It could be simulated through the use of an offset stored in another register that is added to addresses using lqx & stqx instructions, which would achieve the same kind of functionality.
3. Using Rep
REP is a fascinating little instruction (modifier? Appears to be a single byte in disassembly), but there’s no direct equivalent for the SPU. It could be mimicked in a general way on SPU using branches, but branches are a topic of later parts in the series so I’ll avoid that for now.
Instead, because the length and alignments of source and destination are known, it can be (effectively) unrolled and branchless :)
85 #movl $HelloWorldString, %esi 86 #movl $DestinationUsingRep, %edi 87 #movl $25, %ecx # set the string length in ECX 88 #cld # clear the DF 89 #rep movsb 91 95 lqa $20,HelloWorldString 96 lqa $21,HelloWorldString+16 97 lqa $22,DestinationUsingRep+16 98 fsmbi $23,0x007f 99 selb $22,$21,$22,$23 100 stqa $20,DestinationUsingRep 101 stqa $22,DestinationUsingRep+16
A simple load and store for the first quadword, and a merge of the second with its destination. Again, making further assumptions about the destination memory would remove the need for lines 97–99.
4. Loading string from memory into EAX register
106 #leal HelloWorldString, %esi 107 #lodsb 112 ila $24,HelloWorldString 113 lqa $25,HelloWorldString 114 ai $26,$24,-3 115 rotqby $27,$25,$26 116 117 #dec %esi 118 #lodsw 120 ai $28,$24,-2 121 rotqby $29,$25,$28 122 125 #subl $2, %esi # Make ESI point back to the original string 126 #lodsl 128 rotqby $30,$25,$24
Similar to 1. — loading data and rotating into the preferred slot of a register. Assumptions about source offset and that the loaded data doesn’t span a quadword boundary makes this much simpler than it would otherwise be.
5. Storing strings from EAX to memory
8 .align 4 9 shuf_AABABCDghijklmnop: 10 .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f 133 #leal DestinationUsingStos, %edi 134 #stosb 135 #stosw 136 #stosl 137 141 lqa $31,DestinationUsingStos 142 lqa $32,shuf_AABABCDghijklmnop 143 shufb $33,$30,$31,$32 144 stqa $33,DestinationUsingStos
Rather than more repetitive merging and messing about, I chose to combine the three stores and get the same effect from a single shuffle — which includes merging with the contents of the destination quadword.
(The shuffle name reveals the intended function: capital letters refer to bytes from the first register, lower-case from the second. I picked up this scheme from Insomniac Games’s R&D pages.)
6. Comparing strings
149 #cld 150 #leal HelloWorldString, %esi 151 #leal H3110, %edi 152 #cmpsb 153 154 lqa $34,HelloWorldString 155 lqa $35,H3110 156 ceqb $36,$34,$35
There’s no byte-subtraction instruction for SPU, but ceqb will compare the bytes in two registers for equality. That should be enough to work out if two strings match, but doesn’t give the kind of ordering that you’ll get from strcmp(). Getting the result from ceqb and making some kind of use of it may require shifts, rotates or other shenanigans.
Jaymin Kessler has a post on scanning through pixel values that is also relevant for a number of string manipulation problems.
Previous assembly primer notes…
Part 1 — System Organization — PPC — SPU
Part 2 — Memory Organisation — SPU
Part 3 — GDB Usage Primer — PPC & SPU
Part 4 — Hello World — PPC — SPU
Part 5 — Data Types — PPC & SPU
Part 6 — Moving Data — PPC — SPU