Handling vector loop left-overs with masked loads and stores

In the previous example we used a nice little trick which is not available until AVX came to the scene: masked loads and stores.
In this note we’ll go a little deeper into the use of masked loads and stores, and how it can greatly help in handling left-overs after vector loops, as well as dealing with data structures that are simply not a whole multiple of the natural vector size.

We start with a simple problem of adding two 3D vectors:

1
2
3
4
5
6
7
8
9
10
11
// Define a mask for double precision 3d-vector
#define MMM _mm256_set_epi64x(0,~0,~0,~0)
 
// We want to do c=a+b vector-sum
FXVec3d a,b,c;
 
// Set a and b somehow
 
// Use AVX intrinsics
_mm256_maskstore_pd(&c[0],MMM,_mm256_add_pd(_mm256_maskload_pd(&a[0],MMM),
                                            _mm256_maskload_pd(&b[0],MMM)));

This was pretty easy, right? Note that Intel defined those masked-loads and stores in such a way that the store locations are not touched; i.e. they’re not simply loaded and written back with the original values, but never loaded. This is important as you don’t want to incur segmentation violations when your vector happens to be the last thing in a memory page!

Next, we move on to a little more sophisticated use, the wrap-up of left-overs of a vector loop; note that with the masked load and store, we can typically perform the last operation in vector mode as well; we don’t have to resort to plain scalar code like you had to do with SSE.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void add_vectors(float* result,const float* src1,const float src2,int n){
  register __m256i mmmm;
  register __m256 aaaa,bbbb,rrrr;
  register int i=0;
 
  // Vector loop adds 8 paits of floats at a time
  while(i<n-8){
    aaaa=_mm256_loadu_ps(&a[i]);
    bbbb=_mm256_loadu_ps(&b[i]);
    rrrr=_mm256_add_ps(aaaa,bbbb);
    _mm256_storeu_ps(&result[i],rrrr);
    i+=8;
    }
 
  // Load the mask at index n-i; this should be in the range 0...8.
  mmmm=_mm256_castps_si256(_mm256_load_ps((const float*)mask8i[n-i]));
 
  // Use masked loads
  aaaa=_mm256_maskload_ps(&a[i],mmmm);
  bbbb=_mm256_maskload_ps(&b[i],mmmm);
 
  // Same vector operation as main loop
  rrrr=_mm256_add_ps(aaaa,bbbb);
 
  // Use masked store
  _mm256_maskstore_ps(&result[i],mmmm,rrrr);
  }

Note that the loop goes goes one vector short if n is a multiple of eight: since the mop-up code is executed unconditionally, we’d rather to this with actual payload, not with all data masked out.
Also note that we don’t have a special case for n==0. In the rare case that this happens, we will just execute the mop-up code with an all-zeroes mask!

Left to do is build a little array with mask values; due to the observation above, this will have 9 entries, not 8!

1
2
3
4
5
6
7
8
9
10
11
static __align(32) const int  mask8i[9][8]={
  { 0, 0, 0, 0, 0, 0, 0, 0},
  {-1, 0, 0, 0, 0, 0, 0, 0},
  {-1,-1, 0, 0, 0, 0, 0, 0},
  {-1,-1,-1, 0, 0, 0, 0, 0},
  {-1,-1,-1,-1, 0, 0, 0, 0},
  {-1,-1,-1,-1,-1, 0, 0, 0},
  {-1,-1,-1,-1,-1,-1, 0, 0},
  {-1,-1,-1,-1,-1,-1,-1, 0},
  {-1,-1,-1,-1,-1,-1,-1,-1}
  };

In the above, the __align() macro is fleshed out differently depending on your compiler; however it should ensure that the array is aligned to a multiple of 32 bytes (the size of an AVX vector).

Bottom line: the new AVX masked loads solve a real problem that was always a bit awkward to solve before; it allows mop-up code to be vectorized same as the main loop, which is important as you may want to ensure the last couple of numbers get “the same treatment” as the rest of them.

This entry was posted in FOX. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *