{"id":701,"date":"2010-12-01T15:26:32","date_gmt":"2010-12-01T05:26:32","guid":{"rendered":"http:\/\/brnz.org\/hbr\/?p=701"},"modified":"2015-01-25T09:39:42","modified_gmt":"2015-01-24T23:39:42","slug":"assembly-primer-part-6-moving-data-spu","status":"publish","type":"post","link":"https:\/\/brnz.org\/hbr\/?p=701","title":{"rendered":"Assembly Primer Part 6 &#8212; Moving Data &#8212; SPU"},"content":{"rendered":"<p>These are my notes for where I can see SPU varying from ia32, as presented in the video <a href=\"http:\/\/securitytube.net\/Assembly-Primer-for-Hackers-(Part-6)-Moving-Data-video.aspx\">Part 6 \u2014 Moving Data<\/a>.<\/p>\n<p>SPU and ia32 differ significantly when it comes to moving\/copying data around, in terms of the ways things can be copied, the alignment of data in memory and the vector nature of SPU registers.<\/p>\n<p><a href=\"http:\/\/brnz.org\/cell\/doku.php?id=spuinstructions\">This is the quick SPU instruction reference I use<\/a>.\u00a0 The <a href=\"https:\/\/www-01.ibm.com\/chips\/techlib\/techlib.nsf\/techdocs\/76CA6C7304210F3987257060006F2C44\">SPU ISA doc<\/a> is worth having nearby if trying to do silly tricks with SPU instructions.<\/p>\n<h2>Moving Data<\/h2>\n<p>Lets consider what <a href=\"http:\/\/code.securitytube.net\/MovDemo.s\">MovDemo.s<\/a> might look like for SPU, piece by piece.<\/p>\n<p>First, the storage:<\/p>\n<pre escaped=\"true\"># Demo program to show how to use Data types and MOVx instructions \r\n\r\n.data\r\n    HelloWorld:\r\n        .ascii \"Hello World!\"\r\n\r\n    ByteLocation:\r\n        .byte 10\r\n\r\n    Int32:\r\n        .int 2\r\n    Int16:\r\n        .short 3\r\n    Float:\r\n        .float 10.23\r\n\r\n    IntegerArray:\r\n        .int 10,20,30,40,50\r\n<\/pre>\n<p>Icky.\u00a0 Not naturally aligned for size, and crossing qword boundaries.\u00a0 This makes the following code particularly messy, because the SPU <em>really<\/em> doesn&#8217;t like unaligned data.\u00a0 Oh well, let&#8217;s begin.<\/p>\n<h3>1. Immediate value to register<\/h3>\n<pre escaped=\"true\">    .align 3  # ensure code is aligned after awkwardly arranged data\r\n.text\r\n    .globl _start\r\n    _start:\r\n        #movl $10, %eax\r\n        il $5, 10\r\n<\/pre>\n<p>Well, that was easy enough.<\/p>\n<p>It does get a little trickier if trying to load more complex values.\u00a0 To load a full 32-bits of immediate data into a register requires two half-word load\u00a0 instructions for the upper and lower parts.<\/p>\n<p>Loading arbitrary values spanning multiple words is more complex, and is often able to be done simply by storing the constant in .rodata and loading it when needed, although <em>fsmbi<\/em> can be useful on occasion.<\/p>\n<h3>2. Immediate value to local store<\/h3>\n<pre escaped=\"true\">#movw $50, Int16\r\n\r\n# Write first byte\r\nila $6,Int16        # load the address of the target\r\nlqa $7,Int16        # load the qword\r\nil $8,0             # load the upper byte of the constant into preferred slot\r\ncbd $9,0($6)        # insertion mask for higher byte\r\nshufb $10,$8,$7,$9  # shuffle the upper byte into the qword\r\nstqa $10,Int16      # write the word back\r\n# Write second byte\r\nil $11,1            # load 1 for address offsetting\r\ncbd $12,1($6)       # insertion mask for lower byte\r\nil $13,50           # load the lower byte of the constant into preferred slot\r\nlqx $14,$6,$11      # load Int16+1\r\nshufb $15,$13,$14,$12 # shuffle lower byte into qword\r\nstqx $15,$6,$11     # write back qword\r\n<\/pre>\n<p>Writing two bytes to a non-aligned memory location is a messy business.<\/p>\n<ul>\n<li>All loads and stores transfer 16 bytes of data between registers and 16-byte aligned memory locations.<\/li>\n<li>The <em>chd<\/em> instruction, used to generate masks to be used to shuffle halfwords into a quadword assumes that the halfword is aligned in the first place.\u00a0 As it is not, and could be crossing a quadword boundary, I write it here as two bytes.<\/li>\n<\/ul>\n<p>The <em>lqa<\/em> instruction (a-form) is used for the first load, to fetch from an absolute local store address.\u00a0 <em>lqx<\/em> (x-form) is used for the second load, because the address needs to be offset by one &#8212; on first glance, <em>lqa<\/em> and <em>lqd<\/em> appeared to be better choices, but these do not store all of the lower bits of the immediate part to be used, which would have prevented the one-byte offset.\u00a0 So x-form is used as the zeroing of lower bits happens after the addition of the preferred word slots of two registers.<\/p>\n<p>I suspect that this can be done better.<\/p>\n<p>(It&#8217;s worth taking a look at <a href=\"http:\/\/6cycles.maisonikkoku.com\/6Cycles\/6cycles\/Entries\/2010\/11\/27_n00b_tip__Combining_writes.html\">Jaymin&#8217;s post on write-combining<\/a> which looks more deeply at this kind of problem)<\/p>\n<h3>3. Register to registers<\/h3>\n<pre escaped=\"true\">#movl %eax, %ebx\r\nori $16, $5, 0\r\n<\/pre>\n<p>Using the Or Immediate instruction to perform a bitwise Or against zero to perform the copy.\u00a0 This can be achieved using the odd pipeline with a quadword shift or rotate immediate instruction.<\/p>\n<p>Copying smaller portions of a register will require extra instruction(s) for masking or rotation, depending on what exactly needs to be copied and to where.<\/p>\n<h3>4. Local store to register<\/h3>\n<pre escaped=\"true\">#movl Int32, %eax\r\n\r\nila $16,Int32       # load the address\r\nlqa $17,Int32       # load the vector containing the first part\r\nlqa $18,Int32+3     # load the vector containing the second part\r\nfsmbi $19,0xf000    # create a mask for merging them\r\nrotqby $20,$17,$16  # rotate the first part into position\r\nrotqby $21,$18,$16  # rotate the second part into position\r\nrotqby $22,$19,$16  # rotate the select mask into position\r\nselb $23,$20,$21,$22 # select things into the right place.\r\n<\/pre>\n<p>This one is a different class of fun &#8211; loading four bytes that span two quadwords.\u00a0 Using <em>fsmbi<\/em> to generate a mask that is used to combine bytes from the two quadwords.<\/p>\n<p>Again, I suspect there&#8217;s a better way to do it.<\/p>\n<h3>5. Register to local store<\/h3>\n<pre escaped=\"true\">#movb $3, %al\r\n#movb %al, ByteLocation\r\n\r\nila $24,ByteLocation\r\nlqa $25, ByteLocation\r\nil $26,3\r\ncbd $27, 0($24)\r\nshufb $28, $26, $25, $27\r\nstqa $28, ByteLocation\r\n<\/pre>\n<p>Essentially the same problem as 2. on the SPU, but a little simpler because it&#8217;s only a single byte.<\/p>\n<h3>6. Register to indexed memory location<\/h3>\n<pre escaped=\"true\">#movl $0, %ecx\r\n#movl $2, %edi\r\n#movl $22, IntegerArray(%ecx,%edi , 4)\r\nil $29,2                # load the index\r\nila $30,IntegerArray    # load the address of the array\r\nshli $31,$29,2          # shift two left to get byte offset\r\nlqx $32,$30,$31         # load from the sum of the two addresses\r\n# and then write the data to memory...\r\n<\/pre>\n<p>This is an attempt at a comparable indexed access to local store.\u00a0 All I&#8217;ve done here is the address calculation and load &#8212; writing the value is a mess because it&#8217;s not aligned and spans two quadwords, so something like that done in 2. would be required.<\/p>\n<h3>7. Indirect addressing<\/h3>\n<pre escaped=\"true\">movl $Int32, %eax\r\nmovl (%eax), %ebx\r\n\r\nmovl $9, (%eax)\r\n<\/pre>\n<p>I&#8217;ve done these before (essentially) in 2. and 4.<\/p>\n<h2>Concluding thoughts<\/h2>\n<p>Align your data for the SPU.\u00a0 This would have all been much, much simpler (and not much of a challenge) if the data was aligned and variables were in preferred slots.\u00a0 I suspect I&#8217;ll simplify the later parts of the series for SPU by aligning the data first.<\/p>\n<p>I found some useful fragments amongst the Introduction to SPU Optimizations presentations from Naughty Dog &#8212; they&#8217;re a very good read: <a href=\"http:\/\/www.naughtydog.com\/docs\/gdc2010\/intro-spu-optimizations-part-1.pdf\">Part 1<\/a> &amp; <a href=\"http:\/\/www.naughtydog.com\/docs\/gdc2010\/intro-spu-optimizations-part-2.pdf\">Part 2<\/a>.<\/p>\n<h3>Previous assembly primer notes\u2026<\/h3>\n<p>Part 1 \u2014 System Organization \u2014 <a href=\"?p=631\">PPC<\/a> \u2014 <a href=\"?p=632\">SPU<\/a><br \/>\nPart 2 \u2014 Memory Organisation \u2014 <a href=\"?p=633\">SPU<\/a><br \/>\nPart 3 \u2014 GDB Usage Primer \u2014 <a href=\"?p=634\">PPC &amp; SPU<\/a><br \/>\nPart 4 \u2014 Hello World \u2014 <a href=\"https:\/\/brnz.org\/hbr\/?p=635\">PPC<\/a> \u2014 <a href=\"?p=634\">SPU<\/a><br \/>\nPart 5 &#8212; Data Types &#8212; <a href=\"https:\/\/brnz.org\/hbr\/?p=685\">PPC &amp; SPU<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>These are my notes for where I can see SPU varying from ia32, as presented in the video Part 6 \u2014 Moving Data. SPU and ia32 differ significantly when it comes to moving\/copying data around, in terms of the ways things can be copied, the alignment of data in memory and the vector nature of &hellip; <a href=\"https:\/\/brnz.org\/hbr\/?p=701\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Assembly Primer Part 6 &#8212; Moving Data &#8212; SPU&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,26],"tags":[38,40],"_links":{"self":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/701"}],"collection":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=701"}],"version-history":[{"count":26,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/701\/revisions"}],"predecessor-version":[{"id":1427,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/701\/revisions\/1427"}],"wp:attachment":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=701"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=701"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=701"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}