Branch-Free Blend()

Often, we want to evaluate simple expressions like:

1
result=(a<b) ? x : y

Clearly, this implies performing a conditional branch. This is not a huge problem, unless the code fragment absolutely needs to be as fast as possible.
A branch has several performance-impacting problems:

  • Due the uncertain outcome of the condition, processor will be wrong predicting path through the code due to having to evaluate the outcome of the condition.
  • Automatic vectorization will be difficult, as different vector columns will have different condition-outcomes.

If we descend into the Intel assembly realm, we find that comparisons of floating point numbers leaves an all-zeros or all-ones mask in the SSE or AVX register. Thus, we can make this code branch-free by combining the comparison with some logical masking:

1
2
3
__m128d result,c,a,b,x,y;
c=_mm_cmplt_sd(a,b);
result=_mm_or_pd(_mm_and_pd(c,x),_mm_andnot_pd(c,y));

Later Intel processors supporting SSE4.1 include a new _mm_blendv_pd() operation which will combine these three instructions into one single one:

1
2
3
__m128d result,c,a,b,x,y;
c=_mm_cmplt_sd(a,b);
result=_mm_blendv_pd(y,x,c);

To make this a little bit more palatable for the casual user who isn’t too familiar with Intel intrinsic programming, I’ve wrapped these into some more accessible form:

1
2
3
4
5
6
7
8
9
10
11
// Evaluate branch-free (a < b) ? x : y, if supported on processor
static inline double blend(double a,double b,double x,double y){
#if defined(__SSE4_1__)
  return _mm_cvtsd_f64(_mm_blendv_pd(_mm_set_sd(y),_mm_set_sd(x),_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b))));
#elif defined(__SSE2__)
  __m128d cc=_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b));
  return _mm_cvtsd_f64(_mm_or_pd(_mm_and_pd(cc,_mm_set_sd(x)),_mm_andnot_pd(cc,_mm_set_sd(y))));
#else
  return a<b ? x : y;
#endif
  }

And an equivalent version for single precision math:

1
2
3
4
5
6
7
8
9
10
11
// Evaluate branch-free (a < b) ? x : y, if supported on processor
static inline float blend(float a,float b,float x,float y){
#if defined(__SSE4_1__)
  return _mm_cvtss_f32(_mm_blendv_ps(_mm_set_ss(y),_mm_set_ss(x),_mm_cmplt_ps(_mm_set_ss(a),_mm_set_ss(b))));
#elif defined(__SSE__)
  __m128 cc=_mm_cmplt_ss(_mm_set_ss(a),_mm_set_ss(b));
  return _mm_cvtss_f32(_mm_or_ps(_mm_and_ps(cc,_mm_set_ss(x)),_mm_andnot_ps(cc,_mm_set_ss(y))));
#else
  return a<b ? x : y;
#endif
  }

When compiled with optimization, these little routines perform very well indeed. In addition, GCC is clever enough to auto-vectorize when these routines are in a loop.
Of course, for this to work, make sure these functions are declared in a header file.
On machines where _mm_blendv_pd() and _mm_blendv_ps() are available, the resulting output is only two instructions: the compare and the blend.

This entry was posted in FOX. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *