{"id":335,"date":"2009-09-21T10:55:45","date_gmt":"2009-09-21T00:55:45","guid":{"rendered":"http:\/\/brnz.org\/hbr\/?p=335"},"modified":"2010-11-05T11:19:53","modified_gmt":"2010-11-05T01:19:53","slug":"and-now-i-know-adventures-with-double-precision","status":"publish","type":"post","link":"https:\/\/brnz.org\/hbr\/?p=335","title":{"rendered":"And now I know &#8211; adventures in double precision"},"content":{"rendered":"<p>Refining the buddhabrot renderer, I&#8217;ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah.<\/p>\n<p>To begin with, I lifted code from <a href=\"http:\/\/ozlabs.org\/~jk\/diary\/tech\/cell\/hackfest08-solution-4.diary\/\">one of the later revisions<\/a> of Jeremy&#8217;s Mandelbrot renderer. This was written for single precision float, whereas I&#8217;ve been working in double precision for this buddhabrot code.\u00a0 Worth noting on the change from single to double precision &#8211;<\/p>\n<ul>\n<li>Double precision numbers behave differently to single precision on the SPU (see section 9 of the <a href=\"https:\/\/www-01.ibm.com\/chips\/techlib\/techlib.nsf\/techdocs\/76CA6C7304210F3987257060006F2C44\">SPU ISA doc<\/a>) &#8211; I was bitten by infs and NaNs.<\/li>\n<li>When browsing that document, I missed the large &#8220;Optional v1.2&#8221; for instructions like <em>dfcgt<\/em>. To be clear, the Cell BE SPU does not support this instruction.<\/li>\n<li>GCC does include <em>vec_ullong2 spu_cmpgt(vec_double2, vec_double2)<\/em>, but in the absence of <em>dfcgt<\/em> it takes  forty extra instructions to achieve the same result (yeah, that&#8217;s what I get for using general intrinsics)<\/li>\n<\/ul>\n<p>When starting to use double precision, I was expecting much lower performance than single precision on the SPU, but I had not fully understood how much lower &#8211; from the <a href=\"https:\/\/www-01.ibm.com\/chips\/techlib\/techlib.nsf\/techdocs\/1741C509C5F64B3300257460006FD68D\"> Programming Handbook<\/a>, page 71:<\/p>\n<blockquote><p>Although double-precision instructions have 13-clock-cycle latencies, on the Cell\/B.E. processor, only the final seven cycles are pipelined. <strong>No other instructions are dual-issued with double-precision instructions, and no instructions of any kind are issued for six cycles after a double-precision instruction is issued.<\/strong><\/p><\/blockquote>\n<p>Ouch.\u00a0 I knew this, but I didn&#8217;t <em>know<\/em> it &#8211; a run of <em>spu_timing<\/em> on the generated assembly really rammed it home.<\/p>\n<pre style=\"padding-left: 30px;\">0\u00a0 0123456789012\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dfs\u00a0 $75,$45,$44\r\n0\u00a0\u00a0 ------7890123456789\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dfma $46,$59,$47\r\n0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------4567890123456\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dfa\u00a0 $43,$45,$44\r\n0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------1234567890123\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dfa\u00a0 $42,$80,$75\r\n0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------8901234567890\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dfm\u00a0 $32,$46,$46\r\n0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------5678901234567\u00a0\u00a0 frds $40,$43\r\n0\u00a0 01234\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------23456789 dfm\u00a0 $33,$42,$42\r\n0\u00a0 012345678901\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ------9 dfm\u00a0 $36,$42,$81<\/pre>\n<p>(Oh, and I&#8217;ve noticed again that <em>dfma<\/em> and friends use RT as an operand, which presumably makes register scheduling even more fun. The above fragment is from a heavily unrolled inner loop.)<\/p>\n<p>At some point, I&#8217;ll try to measure the practical difference between double and single precision for this program, to see what (if anything) would be lost by switching over to single precision. Or perhaps there&#8217;s some other way around the problem &#8211; I&#8217;ve been considering fixed point or even multi-single precision fp alternatives.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Refining the buddhabrot renderer, I&#8217;ve added vectorisation to iterate two points at once, which gives (at least) twice the performance. Huzzah. To begin with, I lifted code from one of the later revisions of Jeremy&#8217;s Mandelbrot renderer. This was written for single precision float, whereas I&#8217;ve been working in double precision for this buddhabrot code.\u00a0 &hellip; <a href=\"https:\/\/brnz.org\/hbr\/?p=335\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;And now I know &#8211; adventures in double precision&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,4],"tags":[35],"_links":{"self":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/335"}],"collection":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=335"}],"version-history":[{"count":11,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/335\/revisions"}],"predecessor-version":[{"id":344,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/335\/revisions\/344"}],"wp:attachment":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}