{"id":541,"date":"2010-10-29T23:43:48","date_gmt":"2010-10-29T13:43:48","guid":{"rendered":"http:\/\/brnz.org\/hbr\/?p=541"},"modified":"2010-11-05T22:55:35","modified_gmt":"2010-11-05T12:55:35","slug":"averaging-more-unsigned-chars","status":"publish","type":"post","link":"https:\/\/brnz.org\/hbr\/?p=541","title":{"rendered":"Averaging more unsigned chars"},"content":{"rendered":"<p>Some further thoughts, continuing on from my previous <a href=\"https:\/\/brnz.org\/hbr\/?p=508\">average post<\/a>&#8230;<\/p>\n<h3>Alternate methods<\/h3>\n<p>sumb (sum bytes in halfwords) was an instruction I had overlooked,  and was pointed out to me by ralferoo.\u00a0 sumb calculates the sums of the  four bytes of each word in two quadwords at a time. An add, rotate  and shuffle would be all that would be needed to turn the result from  two sumb calls into the desired averages.<\/p>\n<p>Unfortunately, the input  format I&#8217;m using isn&#8217;t well suited to sumb and it would appear to  require a prohibitive number of shuffles to prepare the data  appropriately &#8211; that said, there&#8217;s at least one place where I shuffle  data before calling average4() that may be able to utilise sumb, so I intend to keep it in mind.<\/p>\n<p>The avgb instruction calculates the average of the bytes in two quadwords.\u00a0 It would be nice to be able to call avgb(avgb(a,b),avgb(c,d)) and to have the final result in 9 cycles, but there&#8217;s a fix-up necessary to correct the rounding that takes place in the calculation of the first two averages, and I&#8217;ve not yet been able to wrap my head around the correct method to do so.<\/p>\n<h3>Approximating<\/h3>\n<p>There are plenty of ways to very quickly get a result that is often &#8212; but not always &#8212; correct (like avgb).\u00a0 One of these methods may be suitable for my particular needs, but I won&#8217;t know until later.\u00a0 My goal for now is to attain a result that is as correct as possible and consider ways of speeding it up later, if needed.<\/p>\n<h3>Adding<\/h3>\n<p>I&#8217;m annoyed with myself that I missed this one, as I&#8217;ve seen it several times recently: rounding can be performed correctly with an addition and truncation. Where I had<\/p>\n<pre>    \/\/ add up the lower bits\r\n    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),\r\n                   si_a(si_andbi(c,3),si_andbi(d,3)));\r\n\r\n    \/\/ shift right 2 bits, again masking out shifted-in high bits\r\n    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));\r\n\r\n    \/\/ shift right and mask for the rounding bit\r\n    R = si_a(R, si_andbi(si_rotqmbii(L,-1), 1));<\/pre>\n<p>adding 2 to each uchar before truncating with rotqmbii means that the last line can be eliminated altogether, so the whole function now looks like:<\/p>\n<pre>qword average4(qword a, qword b, qword c, qword d) {\r\n    \/\/ shift each right by 2 bits, masking shifted-in bits from the result\r\n    qword au = si_andbi(si_rotqmbii(a, -2), 0x3f);\r\n    qword bu = si_andbi(si_rotqmbii(b, -2), 0x3f);\r\n    qword cu = si_andbi(si_rotqmbii(c, -2), 0x3f);\r\n    qword du = si_andbi(si_rotqmbii(d, -2), 0x3f);\r\n\r\n    \/\/ add them all up\r\n    qword R = si_a(si_a(au,bu), si_a(cu,du));\r\n\r\n    \/\/ add up the lower bits\r\n    qword L = si_a(si_a(si_andbi(a,3),si_andbi(b,3)),\r\n                   si_a(si_andbi(c,3),si_andbi(d,3)));\r\n\r\n    \/\/ add 2\r\n    L = si_a(L, si_ilh(0x202));\r\n\r\n    \/\/ shift right 2 bits, again masking out shifted-in high bits\r\n    R = si_a(R, si_andbi(si_rotqmbii(L,-2), 3));\r\n\r\n    return R;\r\n}<\/pre>\n<p>The difference is pretty minor &#8212; a couple of instructions and (when not inlined) it&#8217;s no faster.\u00a0 For the program it&#8217;s used in I&#8217;m seeing around a 1.5% runtime reduction.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some further thoughts, continuing on from my previous average post&#8230; Alternate methods sumb (sum bytes in halfwords) was an instruction I had overlooked, and was pointed out to me by ralferoo.\u00a0 sumb calculates the sums of the four bytes of each word in two quadwords at a time. An add, rotate and shuffle would be &hellip; <a href=\"https:\/\/brnz.org\/hbr\/?p=541\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Averaging more unsigned chars&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,26,4],"tags":[30,36],"_links":{"self":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/541"}],"collection":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=541"}],"version-history":[{"count":6,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/541\/revisions"}],"predecessor-version":[{"id":565,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=\/wp\/v2\/posts\/541\/revisions\/565"}],"wp:attachment":[{"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=541"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=541"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/brnz.org\/hbr\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=541"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}