Often, we want to evaluate simple expressions like:

1 | result=(a<b) ? x : y |

Clearly, this implies performing a conditional branch. This is not a huge problem, unless the code fragment absolutely needs to be as fast as possible.

A branch has several performance-impacting problems:

- Due the uncertain outcome of the condition, processor will be wrong predicting path through the code due to having to evaluate the outcome of the condition.
- Automatic vectorization will be difficult, as different vector columns will have different condition-outcomes.

If we descend into the Intel assembly realm, we find that comparisons of floating point numbers leaves an all-zeros or all-ones mask in the SSE or AVX register. Thus, we can make this code branch-free by combining the comparison with some logical masking:

1 2 3 | __m128d result,c,a,b,x,y; c=_mm_cmplt_sd(a,b); result=_mm_or_pd(_mm_and_pd(c,x),_mm_andnot_pd(c,y)); |

Later Intel processors supporting SSE4.1 include a new _mm_blendv_pd() operation which will combine these three instructions into one single one:

1 2 3 | __m128d result,c,a,b,x,y; c=_mm_cmplt_sd(a,b); result=_mm_blendv_pd(y,x,c); |

To make this a little bit more palatable for the casual user who isn’t too familiar with Intel intrinsic programming, I’ve wrapped these into some more accessible form:

1 2 3 4 5 6 7 8 9 10 11 | // Evaluate branch-free (a < b) ? x : y, if supported on processor static inline double blend(double a,double b,double x,double y){ #if defined(__SSE4_1__) return _mm_cvtsd_f64(_mm_blendv_pd(_mm_set_sd(y),_mm_set_sd(x),_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b)))); #elif defined(__SSE2__) __m128d cc=_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b)); return _mm_cvtsd_f64(_mm_or_pd(_mm_and_pd(cc,_mm_set_sd(x)),_mm_andnot_pd(cc,_mm_set_sd(y)))); #else return a<b ? x : y; #endif } |

And an equivalent version for single precision math:

1 2 3 4 5 6 7 8 9 10 11 | // Evaluate branch-free (a < b) ? x : y, if supported on processor static inline float blend(float a,float b,float x,float y){ #if defined(__SSE4_1__) return _mm_cvtss_f32(_mm_blendv_ps(_mm_set_ss(y),_mm_set_ss(x),_mm_cmplt_ps(_mm_set_ss(a),_mm_set_ss(b)))); #elif defined(__SSE__) __m128 cc=_mm_cmplt_ss(_mm_set_ss(a),_mm_set_ss(b)); return _mm_cvtss_f32(_mm_or_ps(_mm_and_ps(cc,_mm_set_ss(x)),_mm_andnot_ps(cc,_mm_set_ss(y)))); #else return a<b ? x : y; #endif } |

When compiled with optimization, these little routines perform very well indeed. In addition, GCC is clever enough to auto-vectorize when these routines are in a loop.

Of course, for this to work, make sure these functions are declared in a header file.

On machines where _mm_blendv_pd() and _mm_blendv_ps() are available, the resulting output is only two instructions: the compare and the blend.