Distance Squared optimization using SSE instruction set

K-MEANS / Distance Squared optimisation using SSE instruction

A long time ago, a friend ask me to try some KMEANS algorithm optimization.

The perfect piece of code candidate for optimization was selected: a little function used to calculate the distance squared between two points.

Like a good chocolate mousse, SSE2 was used to optimize and perform distance squared operation using “big register”

SSE2 (Streaming SIMD Extensions 2), is one of the Intel SIMD (Single Instruction, Multiple Data) processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully replace MMX. Intel extended SSE2 to create SSE3 in 2004. SSE2 added 144 new instructions to SSE, which has 70 instructions. Competing chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of AMD64 64-bit CPUs in 2003.

Most of the SSE2 instructions implement the integer vector operations also found in MMX. They use the XMM registers instead of the MMX registers, which are wider and allow for significant performance improvements in specialized applications.

 

After several days of try/error, finally i reach to gain some time ( around  22% ) – So, yes assembler optimisation can be usefull but should be used for well selected part of code (innerloop). But assembler optimization still need expensive time to do.

 

A next try should be to use GPU capabilities to reduce time calculation. But need also to adapt the code to pass massive input/output data using specific texture buffer.

Original C++ vs optimized SSE version

/**
 * Returns the distance squared between two points.
 **/
Scalar distSq(const Point &lhs, const Point &rhs) {
 ASSERT(lhs.getNumDimensions() == rhs.getNumDimensions());
 Scalar result = 0;
 for (int i = 0; i < lhs.getNumDimensions(); i++)
 result += (lhs[i] - rhs[i]) * (lhs[i] - rhs[i]);
 return result;
}	
double distSq(double* a,double* b)
{
 unsigned int i=0;
 unsigned int size = data->nCol; 
 double fres = 0.0;

__declspec(align(16)) double ftmp[2] = { 0.0, 0.0 };
 __m128d mres, mres1,mres2;
 __m128d m1,m2; 


 mres = _mm_xor_pd(mres,mres); 


 for (unsigned int i = 0; i < size / 16 ; i+=1)  {
   mres1 = _mm_sub_pd(_mm_loadu_pd(&a[16*i]),_mm_loadu_pd(&b[16*i])); 
   m1 = _mm_mul_pd(mres1,mres1); 
   mres2 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+2]),_mm_loadu_pd(&b[16*i+2])); 
   mres = _mm_add_pd(mres,m1);// 
   m2 = _mm_mul_pd(mres2,mres2); 
   mres1 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+4]),_mm_loadu_pd(&b[16*i+4])); 
   mres = _mm_add_pd(mres,m2);// 
   m1 = _mm_mul_pd(mres1,mres1); 
   mres2 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+6]),_mm_loadu_pd(&b[16*i+6])); 
   mres = _mm_add_pd(mres,m1);// 
   m2 = _mm_mul_pd(mres2,mres2); 
   mres1 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+8]),_mm_loadu_pd(&b[16*i+8])); 
   mres = _mm_add_pd(mres,m2);// 
   m1 = _mm_mul_pd(mres1,mres1); 
   mres2 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+10]),_mm_loadu_pd(&b[16*i+10])); 
   mres = _mm_add_pd(mres,m1);// 
   m2 = _mm_mul_pd(mres2,mres2); 
   mres1 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+12]),_mm_loadu_pd(&b[16*i+12])); 
   mres = _mm_add_pd(mres,m2);// 
   m1 = _mm_mul_pd(mres1,mres1); 
   mres2 = _mm_sub_pd(_mm_loadu_pd(&a[16*i+14]),_mm_loadu_pd(&b[16*i+14])); 
   mres = _mm_add_pd(mres,m1);// 
   m2 = _mm_mul_pd(mres2,mres2); 
   mres = _mm_add_pd(mres,m2);// 
 }
 _mm_store_pd(ftmp, mres); 
fres = ftmp[0] + ftmp[1]; 

 if ((size % 16) != 0) {
   for (unsigned int i = size - (size % 16); i < size; i++) fres += (a[i]-b[i]) * (a[i]-b[i]);
 }
 return fres;
}

Test Result

RUNNING K-MEANS WITH UNIFORM SEEDING, looking for 25 clusters
=============================================================

Original : Potential at 140 iterations: 1.50983e+008 at Time: 2.21731

SSE optimized : Potential at 140 iterations: 1.50983e+008 at Time: 1.74357 + 21,3%

RUNNING K-MEANS++, looking for 25 clusters
==========================================

Original : Potential at 30 iterations: 1.64884e+007 at Time: 0.471352

SSE optimized : Potential at 30 iterations: 1.64884e+007 at Time: 0.368786 + 21,7%

RUNNING K-MEANS++ WITH RETRIES DURING SEEDING, looking for 25 clusters
======================================================================

Original : Potential at 45 iterations: 1.62709e+007 at Time: 0.705721

SSE optimized : Potential at 45 iterations: 1.62709e+007 at Time: 0.552436  + 21,8%