{"id":793,"date":"2010-12-03T14:13:05","date_gmt":"2010-12-03T04:13:05","guid":{"rendered":"http:\/\/brnz.org\/hbr\/?p=793"},"modified":"2010-12-03T14:27:55","modified_gmt":"2010-12-03T04:27:55","slug":"assembly-primer-part-7-working-with-strings-spu","status":"publish","type":"post","link":"https:\/\/brnz.org\/hbr\/?p=793","title":{"rendered":"Assembly Primer Part 7 \u2014 Working with Strings \u2014 SPU"},"content":{"rendered":"<p>These are my notes for where I can see SPU varying from ia32, as presented in the video <a href=\"http:\/\/securitytube.net\/Assembly-Primer-for-Hackers-%28Part-7%29-Working-with-Strings-video.aspx\">Part 7 \u2014 Working with Strings<\/a>.<\/p>\n<p>The ia32 instructions covered in this video (MOVSx, LODSx, STOSx, CMPSx, CLD, STD) clearly highlight many of the differences between that arch and SPU:<\/p>\n<ul>\n<li>Implied operand registers e.g. MOVSx using %esi and %edi as source and destination addresses.<\/li>\n<li>Side effects on operands e.g. incrementing addresses while performing reads\/writes<\/li>\n<li>Side effects in the FLAGS register e.g. direction flag, zero flag<\/li>\n<li>Support for any alignment of data.<\/li>\n<\/ul>\n<p>I&#8217;ve made less effort to re-implement the full functionality of the ia32 instructions for this part.\u00a0 There&#8217;s a couple of cases that might be interesting to attempt to do so, but a fully general case can probably just be lifted from <a href=\"http:\/\/sourceware.org\/cgi-bin\/cvsweb.cgi\/src\/newlib\/libc\/machine\/spu\/memcpy.c?rev=1.4&amp;content-type=text\/x-cvsweb-markup&amp;cvsroot=src\">newlib&#8217;s memcpy() implementation for SPU<\/a>.<\/p>\n<p>Again,\u00a0 <a href=\"http:\/\/brnz.org\/cell\/doku.php?id=spuinstructions\">this is the quick SPU instruction reference I use<\/a>, and I still regularly refer to the <a href=\"https:\/\/www-01.ibm.com\/chips\/techlib\/techlib.nsf\/techdocs\/76CA6C7304210F3987257060006F2C44\">SPU ISA doc<\/a>.<\/p>\n<h2>Working with Strings<\/h2>\n<p>Starting with <a href=\"http:\/\/code.securitytube.net\/StringBasics.s\">StringBasics.s<\/a>, firth there&#8217;s some storage:<\/p>\n<pre style=\"padding-left: 30px;\"> 1 .data\r\n 2\u00a0\u00a0\u00a0\u00a0 .align 4\r\n 3\u00a0\u00a0\u00a0\u00a0 HelloWorldString:\r\n 4\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .asciz \"Hello World of Assembly!\"\r\n 5\u00a0\u00a0\u00a0\u00a0 .align 4\r\n 6\u00a0\u00a0\u00a0\u00a0 H3110:\r\n 7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .asciz \"H3110\"\r\n 8\u00a0\u00a0\u00a0\u00a0 .align 4\r\n 9\u00a0\u00a0\u00a0\u00a0 shuf_AABABCDghijklmnop:\r\n 10\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f\r\n 11 #.bss\r\n 12\u00a0\u00a0\u00a0\u00a0 .comm Destination, 100, 16\r\n 13\u00a0\u00a0\u00a0\u00a0 .comm DestinationUsingRep, 100, 16\r\n 14\u00a0\u00a0\u00a0\u00a0 .comm DestinationUsingStos, 100, 16\r\n<\/pre>\n<p>Notable changes from the original:<\/p>\n<ul>\n<li>Added .align 4 before each label to provide 16 byte alignment<\/li>\n<li>There&#8217;s a shuffle pattern added that I&#8217;ll write more about later<\/li>\n<li>.bss is commented out because spu-as doesn&#8217;t like it (I&#8217;m still not clear on why this is the case)<\/li>\n<li>spu-as doesn&#8217;t support alignment for .lcomm but does for .comm (why?), so .lcomm has been replaced with .comm and the trailing &#8220;, 16&#8221; added to each case to provide 16 byte alignment.<br \/>\n(learned: .align doesn&#8217;t have an effect on .comm\/.lcomm, and .comm&#8217;s alignment isn&#8217;t power of 2, \u00e0 la .align &#8212; there&#8217;s nothing hard about assembly programming, really :\\)<\/li>\n<\/ul>\n<h3>1. Simple copying using movsb, movsw, movsl<\/h3>\n<pre style=\"padding-left: 30px;\"> 26\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movl $HelloWorldString, %esi\r\n 27\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movl $Destination, %edi\r\n 28\r\n 29\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ila $5,HelloWorldString\r\n 30\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ila $6,Destination\r\n 31\r\n 32\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #reverse order of instructions to avoid even sillier alignment hassle\r\n 33\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movsw\r\n 34\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $7,0($5)\r\n 35\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $8,0($6)\r\n 36\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $10,$7,$9\r\n 37\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cwd $11,0($6)\r\n 38\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 shufb $12,$10,$8,$11\r\n 39\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqd $12,0($6)\r\n 40\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $5,$5,4\r\n 41\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $6,$6,4\r\n 42\r\n 43\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movsl\r\n 44\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $7,0($5)\r\n 45\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $8,0($6)\r\n 46\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $9,$5,-2\u00a0\u00a0\u00a0\u00a0 # rotate needed to get val into pref slot\r\n 47\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $10,$7,$9\r\n 48\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 chd $11,0($6)\r\n 49\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 shufb $12,$10,$8,$11 # shuffle byte into dest\r\n 50\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqd $12,0($6)\r\n 51\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $5,$5,2\r\n 52\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $6,$6,2\r\n 53\r\n 54\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movsb\r\n 55\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $7,0($5)\r\n 56\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqd $8,0($6)\r\n 57\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $9,$5,-3\u00a0\u00a0\u00a0\u00a0 # rotate needed to get val into pref slot\r\n 58\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $10,$7,$9\r\n 59\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cbd $11,0($6)\r\n 60\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 shufb $12,$10,$8,$11 # shuffle byte into dest\r\n 61\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqd $12,0($6)\r\n 62\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $5,$5,1\r\n 63\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $6,$6,1\r\n<\/pre>\n<p>Of the examples in this part, this is my biggest attempt at a &#8220;complete&#8221; implementation of the ia32 instructions, and even then it&#8217;s built on the assumption of natural alignment (of words and halfwords) not present in the ia32 code.\u00a0 Lots of effort to achieve some simple tasks.<\/p>\n<p>Making the extra assumption that the alignment of source and destination match, the three MOVS instructions can be combined and simplified to something like:<\/p>\n<pre style=\"padding-left: 30px;\"> 66\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $13,HelloWorldString\r\n 67\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $14,Destination\r\n 68\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 fsmbi $15,0x01ff\r\n 69\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 selb $16,$13,$14,$15 # copy only desired bytes into destination vector\r\n 70\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqa $14,Destination\r\n<\/pre>\n<p>With the further assumption that the destination was able to be trashed entirely, this could be reduced to just a load and a store.<\/p>\n<h3>2. Setting \/ clearing the DF flag<\/h3>\n<p>There is no DF flag.\u00a0 It could be simulated through the use of an offset stored in another register that is added to addresses using <em>lqx<\/em> &amp; <em>stqx<\/em> instructions, which would achieve the same kind of functionality.<\/p>\n<h3>3. Using Rep<\/h3>\n<p>REP is a fascinating little instruction (modifier? Appears to be a single byte in disassembly), but there&#8217;s no direct equivalent for the SPU.\u00a0 It could be mimicked in a general way on SPU using branches, but branches are a topic of later parts in the series so I&#8217;ll avoid that for now.<\/p>\n<p>Instead, because the length and alignments of source and destination are known, it can be (effectively) unrolled and branchless :)<\/p>\n<pre style=\"padding-left: 30px;\"> 85\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movl $HelloWorldString, %esi\r\n 86\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movl $DestinationUsingRep, %edi\r\n 87\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #movl $25, %ecx # set the string length in ECX\r\n 88\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #cld # clear the DF\r\n 89\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #rep movsb\r\n 91\r\n 95\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $20,HelloWorldString\r\n 96\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $21,HelloWorldString+16\r\n 97\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $22,DestinationUsingRep+16\r\n 98\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 fsmbi $23,0x007f\r\n 99\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 selb $22,$21,$22,$23\r\n100\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqa $20,DestinationUsingRep\r\n101\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqa $22,DestinationUsingRep+16\r\n<\/pre>\n<p>A simple load and store for the first quadword, and a merge of the second with its destination.\u00a0 Again, making further assumptions about the destination memory would remove the need for lines 97&#8211;99.<\/p>\n<h3>4. Loading string from memory into EAX register<\/h3>\n<pre style=\"padding-left: 30px;\">106\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #leal HelloWorldString, %esi\r\n107\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #lodsb\r\n112\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ila $24,HelloWorldString\r\n113\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $25,HelloWorldString\r\n114\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $26,$24,-3\r\n115\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $27,$25,$26\r\n116\r\n117\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #dec %esi\r\n118\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #lodsw\r\n120\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ai $28,$24,-2\r\n121\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $29,$25,$28\r\n122\r\n125\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #subl $2, %esi # Make ESI point back to the original string\r\n126\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #lodsl\r\n128\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 rotqby $30,$25,$24\r\n<\/pre>\n<p>Similar to 1. &#8212; loading data and rotating into the preferred slot of a register.\u00a0 Assumptions about source offset and that the loaded data doesn&#8217;t span a quadword boundary makes this much simpler than it would otherwise be.<\/p>\n<h3>5. Storing strings from EAX to memory<\/h3>\n<pre style=\"padding-left: 30px;\">\r\n  8\u00a0\u00a0\u00a0\u00a0 .align 4\r\n  9\u00a0\u00a0\u00a0\u00a0 shuf_AABABCDghijklmnop:\r\n 10\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 .int 0x00000100,0x01020317,0x18191a1b,0x1c1d1e1f\r\n\r\n133\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #leal DestinationUsingStos, %edi\r\n134\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #stosb\r\n135\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #stosw\r\n136\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #stosl\r\n137\r\n141\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $31,DestinationUsingStos\r\n142\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $32,shuf_AABABCDghijklmnop\r\n143\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 shufb $33,$30,$31,$32\r\n144\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 stqa $33,DestinationUsingStos\r\n<\/pre>\n<p>Rather than more repetitive merging and messing about, I chose to combine the three stores and get the same effect from a single shuffle &#8212; which includes merging with the contents of the destination quadword.<\/p>\n<p>(The shuffle name reveals the intended function: capital letters refer to bytes from the first register, lower-case from the second.\u00a0 I picked up this scheme from <a href=\"http:\/\/www.insomniacgames.com\/research_dev\/articles\/2008\/1500765\">Insomniac Games&#8217;s R&amp;D pages<\/a>.)<\/p>\n<h3>6. Comparing strings<\/h3>\n<pre style=\"padding-left: 30px;\">149\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #cld\r\n150\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #leal HelloWorldString, %esi\r\n151\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #leal H3110, %edi\r\n152\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 #cmpsb\r\n153\r\n154\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $34,HelloWorldString\r\n155\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 lqa $35,H3110\r\n156\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ceqb $36,$34,$35\r\n<\/pre>\n<p>There&#8217;s no byte-subtraction instruction for SPU, but <em>ceqb<\/em> will compare the bytes in two registers for equality.\u00a0 That should be enough to work out if two strings match, but doesn&#8217;t give the kind of ordering that you&#8217;ll get from strcmp().\u00a0\u00a0 Getting the result from <em>ceqb<\/em> and making some kind of use of it may require shifts, rotates or other shenanigans.<\/p>\n<p><a href=\"http:\/\/6cycles.maisonikkoku.com\/6Cycles\/6cycles\/Entries\/2010\/4\/17_n00b_tip__Psychic_computing.html\">Jaymin Kessler has a post on scanning through pixel values<\/a> that is also relevant for a number of string manipulation problems.<\/p>\n<h3>Previous assembly primer notes\u2026<\/h3>\n<p>Part 1 \u2014 System Organization \u2014 <a href=\"?p=631\">PPC<\/a> \u2014 <a href=\"?p=632\">SPU<\/a><br \/>\nPart 2 \u2014 Memory Organisation \u2014 <a href=\"?p=633\">SPU<\/a><br \/>\nPart 3 \u2014 GDB Usage Primer \u2014 <a href=\"?p=634\">PPC &amp; SPU<\/a><br \/>\nPart 4 \u2014 Hello World \u2014 <a href=\"https:\/\/brnz.org\/hbr\/?p=635\">PPC<\/a> \u2014 <a href=\"?p=634\">SPU<\/a><br \/>\nPart 5 &#8212; Data Types &#8212; <a href=\"https:\/\/brnz.org\/hbr\/?p=685\">PPC &amp; SPU<\/a><br \/>\nPart 6 &#8212; Moving Data &#8212; <a href=\"https:\/\/brnz.org\/hbr\/?p=724\">PPC<\/a> &#8212; <a href=\"https:\/\/brnz.org\/hbr\/?p=701\">SPU<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>These are my notes for where I can see SPU varying from ia32, as presented in the video Part 7 \u2014 Working with Strings. The ia32 instructions covered in this video (MOVSx, LODSx, STOSx, CMPSx, CLD, STD) clearly highlight many of the differences between that arch and SPU: Implied operand registers e.g. MOVSx using %esi &hellip; <a href=\"https:\/\/brnz.org\/hbr\/?p=793\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Assembly Primer Part 7 \u2014 Working with Strings \u2014 SPU&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,26],"tags":[38,40],"_links":{"self":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/793"}],"collection":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=793"}],"version-history":[{"count":18,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/793\/revisions"}],"predecessor-version":[{"id":810,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/793\/revisions\/810"}],"wp:attachment":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=793"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=793"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=793"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}