{"id":402,"date":"2017-05-15T16:02:36","date_gmt":"2017-05-15T16:02:36","guid":{"rendered":"http:\/\/imalogic.com\/blog\/?p=402"},"modified":"2017-05-15T16:26:52","modified_gmt":"2017-05-15T16:26:52","slug":"distance-squared-optimization-using-sse-instruction-set","status":"publish","type":"post","link":"https:\/\/imalogic.com\/blog\/2017\/05\/15\/distance-squared-optimization-using-sse-instruction-set\/","title":{"rendered":"Distance Squared optimization using SSE instruction set"},"content":{"rendered":"<body><p><\/p>\n<h2>K-MEANS \/ Distance Squared optimisation using SSE instruction<\/h2>\n<p>A long time ago, a friend ask me to try some KMEANS algorithm optimization.<a href=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"407\" data-permalink=\"https:\/\/imalogic.com\/blog\/2017\/05\/15\/distance-squared-optimization-using-sse-instruction-set\/362679_3042860063\/\" data-orig-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?fit=748%2C499&amp;ssl=1\" data-orig-size=\"748,499\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"362679_3042860063\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?fit=748%2C499&amp;ssl=1\" class=\"size-medium wp-image-407 alignright\" src=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?resize=300%2C200&#038;ssl=1\" alt=\"\" width=\"300\" height=\"200\" loading=\"lazy\" srcset=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?resize=300%2C200&amp;ssl=1 300w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?resize=120%2C80&amp;ssl=1 120w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?resize=480%2C320&amp;ssl=1 480w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/362679_3042860063.jpg?w=748&amp;ssl=1 748w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>The perfect piece of code candidate for optimization was selected: a little function used to calculate the distance squared between two points.<\/p>\n<p>Like a good chocolate mousse, SSE2 was used to optimize and perform distance squared operation using \u201cbig register\u201d<\/p>\n<p><b>SSE2<\/b> (<b>Streaming SIMD Extensions 2<\/b>), is one of the Intel <a title=\"SIMD\" href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\">SIMD<\/a> (Single Instruction, Multiple Data) <a class=\"mw-redirect\" title=\"Processor supplementary instruction\" href=\"https:\/\/en.wikipedia.org\/wiki\/Processor_supplementary_instruction\">processor supplementary instruction<\/a> sets first introduced by <a title=\"Intel\" href=\"https:\/\/en.wikipedia.org\/wiki\/Intel\">Intel<\/a> with the initial version of the <a title=\"Pentium 4\" href=\"https:\/\/en.wikipedia.org\/wiki\/Pentium_4\">Pentium 4<\/a> in 2001. It extends the earlier <a title=\"Streaming SIMD Extensions\" href=\"https:\/\/en.wikipedia.org\/wiki\/Streaming_SIMD_Extensions\">SSE<\/a> instruction set, and is intended to fully replace <a title=\"MMX (instruction set)\" href=\"https:\/\/en.wikipedia.org\/wiki\/MMX_(instruction_set)\">MMX<\/a>. Intel extended SSE2 to create <a title=\"SSE3\" href=\"https:\/\/en.wikipedia.org\/wiki\/SSE3\">SSE3<\/a> in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker <a class=\"mw-redirect\" title=\"AMD\" href=\"https:\/\/en.wikipedia.org\/wiki\/AMD\">AMD<\/a> added support for SSE2 with the introduction of their <a title=\"Opteron\" href=\"https:\/\/en.wikipedia.org\/wiki\/Opteron\">Opteron<\/a> and <a title=\"Athlon 64\" href=\"https:\/\/en.wikipedia.org\/wiki\/Athlon_64\">Athlon 64<\/a> ranges of <a title=\"X86-64\" href=\"https:\/\/en.wikipedia.org\/wiki\/X86-64\">AMD64<\/a> 64-bit CPUs in 2003.<a href=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"404\" data-permalink=\"https:\/\/imalogic.com\/blog\/2017\/05\/15\/distance-squared-optimization-using-sse-instruction-set\/616705-3-65hwpei2\/\" data-orig-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?fit=354%2C262&amp;ssl=1\" data-orig-size=\"354,262\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"616705-3-65HWPEI2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?fit=354%2C262&amp;ssl=1\" class=\"size-medium wp-image-404 alignleft\" src=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?resize=300%2C222&#038;ssl=1\" alt=\"\" width=\"300\" height=\"222\" loading=\"lazy\" srcset=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?resize=300%2C222&amp;ssl=1 300w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?w=354&amp;ssl=1 354w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Most of the SSE2 instructions implement the integer vector operations also found in MMX. They use the XMM registers instead of the MMX registers, which are wider and allow for significant performance improvements in specialized applications.<\/p>\n<p>\u00a0<\/p>\n<p>After several days of try\/error, finally i reach to gain some time ( around \u00a022% )\u00a0\u2013 So, yes assembler optimisation can be usefull but should be used for well selected part of code (innerloop). But assembler optimization still need expensive time to do.<\/p>\n<p>\u00a0<\/p>\n<p>A next try should be to use GPU capabilities to reduce time calculation. But need also to adapt the code to pass massive input\/output data using specific texture buffer.<\/p>\n<p><strong>Original C++ vs optimized SSE version<\/strong><\/p>\n<pre class=\"code highlight\"><code>\/**\r\n * Returns the distance squared between two points.\r\n **\/\r\nScalar distSq(const Point &amp;lhs, const Point &amp;rhs) {\r\n ASSERT(lhs.getNumDimensions() == rhs.getNumDimensions());\r\n Scalar result = 0;\r\n for (int i = 0; i &lt; lhs.getNumDimensions(); i++)\r\n result += (lhs[i] - rhs[i]) * (lhs[i] - rhs[i]);\r\n return result;\r\n}<span id=\"LC56\" class=\"line\">\t<\/span><\/code><\/pre>\n<pre>double distSq(double* a,double* b)\r\n{\r\n unsigned int i=0;\r\n unsigned int size = data-&gt;nCol; \r\n double fres = 0.0;\r\n\r\n__declspec(align(16)) double ftmp[2] = { 0.0, 0.0 };\r\n __m128d mres, mres1,mres2;\r\n __m128d m1,m2; \r\n\r\n\r\n mres = _mm_xor_pd(mres,mres); \r\n\r\n\r\n for (unsigned int i = 0; i &lt; size \/ 16 ; i+=1) \u00a0{\r\n\u00a0 \u00a0mres1 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i]),_mm_loadu_pd(&amp;b[16*i])); \r\n\u00a0 \u00a0m1 = _mm_mul_pd(mres1,mres1); \r\n\u00a0 \u00a0mres2 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+2]),_mm_loadu_pd(&amp;b[16*i+2])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m1);\/\/ \r\n\u00a0 \u00a0m2 = _mm_mul_pd(mres2,mres2); \r\n\u00a0 \u00a0mres1 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+4]),_mm_loadu_pd(&amp;b[16*i+4])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m2);\/\/ \r\n\u00a0 \u00a0m1 = _mm_mul_pd(mres1,mres1); \r\n\u00a0 \u00a0mres2 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+6]),_mm_loadu_pd(&amp;b[16*i+6])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m1);\/\/ \r\n\u00a0 \u00a0m2 = _mm_mul_pd(mres2,mres2); \r\n\u00a0 \u00a0mres1 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+8]),_mm_loadu_pd(&amp;b[16*i+8])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m2);\/\/ \r\n\u00a0 \u00a0m1 = _mm_mul_pd(mres1,mres1); \r\n\u00a0 \u00a0mres2 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+10]),_mm_loadu_pd(&amp;b[16*i+10])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m1);\/\/ \r\n\u00a0 \u00a0m2 = _mm_mul_pd(mres2,mres2); \r\n\u00a0 \u00a0mres1 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+12]),_mm_loadu_pd(&amp;b[16*i+12])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m2);\/\/ \r\n\u00a0 \u00a0m1 = _mm_mul_pd(mres1,mres1); \r\n\u00a0 \u00a0mres2 = _mm_sub_pd(_mm_loadu_pd(&amp;a[16*i+14]),_mm_loadu_pd(&amp;b[16*i+14])); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m1);\/\/ \r\n\u00a0 \u00a0m2 = _mm_mul_pd(mres2,mres2); \r\n\u00a0 \u00a0mres = _mm_add_pd(mres,m2);\/\/ \r\n }\r\n _mm_store_pd(ftmp, mres); \r\nfres = ftmp[0] + ftmp[1]; \r\n\r\n if ((size % 16) != 0) {\r\n\u00a0 \u00a0for (unsigned int i = size - (size % 16); i &lt; size; i++) fres += (a[i]-b[i]) * (a[i]-b[i]);\r\n }\r\n return fres;\r\n}<\/pre>\n<h1>Test Result<\/h1>\n<p>RUNNING K-MEANS WITH UNIFORM SEEDING, looking for 25 clusters<br>\n=============================================================<\/p>\n<p>Original : Potential at 140 iterations: 1.50983e+008 at Time: <strong>2.21731<\/strong><\/p>\n<p><strong>SSE optimized<\/strong> : Potential at 140 iterations: 1.50983e+008 at Time: <strong>1.74357 <span style=\"color: #339966;\">+ 21,3%<\/span><\/strong><\/p>\n<p>RUNNING K-MEANS++, looking for 25 clusters<br>\n==========================================<\/p>\n<p>Original :\u00a0Potential at 30 iterations: 1.64884e+007 at Time: <strong>0.471352<\/strong><\/p>\n<p><strong>SSE optimized<\/strong> : Potential at 30 iterations: 1.64884e+007 at Time: <strong>0.368786 <span style=\"color: #339966;\">+ 21,7%<\/span><\/strong><\/p>\n<p>RUNNING K-MEANS++ WITH RETRIES DURING SEEDING, looking for 25 clusters<br>\n======================================================================<\/p>\n<p>Original :\u00a0Potential at 45 iterations: 1.62709e+007 at Time: <strong>0.705721<\/strong><\/p>\n<p><strong>SSE optimized<\/strong> : Potential at 45 iterations: 1.62709e+007 at Time: <strong>0.552436 \u00a0<span style=\"color: #339966;\">+ 21,8%<\/span><\/strong><\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>K-MEANS \/ Distance Squared optimisation using SSE instruction A long time ago, a friend ask me to try some KMEANS<\/p>\n","protected":false},"author":1,"featured_media":404,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[7],"tags":[],"class_list":["post-402","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-coding"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2017\/05\/616705-3-65HWPEI2.png?fit=354%2C262&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8J21V-6u","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/402","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/comments?post=402"}],"version-history":[{"count":1,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/402\/revisions"}],"predecessor-version":[{"id":408,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/402\/revisions\/408"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/media\/404"}],"wp:attachment":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/media?parent=402"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/categories?post=402"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/tags?post=402"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}