FUN With AVX

Over the last few years, Intel and AMD have added 256-bit vector-support to their processors. The support for these wider vectors is commonly known as AVX (Advanced Vector eXtension).

Since wider vectors also introduce more processor-state, in order to use these features its not enough to have a CPU capable of these AVX vectors, but also your operating system and compiler need to be aware of it.

For maximum portability, I recommend using the Intel Intrinsics. These are supported by GCC, LLVM, as well as late-model Microsoft and Intel compilers. The advantage of using Intrincics are:

  1. Its more easy to work with for the developer, since you can embed these in your regular C/C++ code.
  2. The “smarts” of the compiler regarding register-assignments, common subexpression elimination, and other data-flow analysis goodies are at your disposal.
  3. Target architecture setting of the compiler will automatically use the new VEX instruction encoding, even for code originally written with SSE in mind.

The matrix classes in FOX were originally vectorized for SSE (actually, SSE2/SSE3). Compiling with -mavx will automatically kick in the new VEX encoding for these same SSE intrinsics.  This is nice because AVX supports three-operand instuctions (of the form A = B op C)  rather than the old two-operand instructions (A = A op B).  This means you can typically make do with fewer registers, and quite possibly eliminate useless mov instructions. Your code will be correspondingly smaller and faster, with no work at all!

However, this of course does not fully exploit the new goodies AVX brings to the table.  The most obvious benefit is the wider vectors, which of course means you can work with twice as much data as before.

For example, given 4×4 double precision matrix a[4][4] and b[4][4], we can now add a and b like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  __m256d a,b,c;
  a=_mm256_loadu_pd(a[0]);
  b=_mm256_loadu_pd(b[0]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[0],c);
  a=_mm256_loadu_pd(a[1]);
  b=_mm256_loadu_pd(b[1]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[1],c);
  a=_mm256_loadu_pd(a[2]);
  b=_mm256_loadu_pd(b[2]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[2],c);
  a=_mm256_loadu_pd(a[3]);
  b=_mm256_loadu_pd(b[3]);
  c=_mm256_add_pd(a,b);
  _mm256_storeu_pd(r[3],c);

This little code fragment performs 16 double precision adds in just 4 vector instructions!
Note unlike the old SSE code, you now declare vector variables as __m256 (float), __m256i (integer), or __m256d (double).
The penalty for accessing unaligned memory addresses is much less for AVX, and thus we can use unaligned loads and stores, at a very modest (and usually not measurable) speed penalty. If you really want to go all out, however, remember to align things to 32 bytes now, not 16 bytes like you did for SSE!

For FOX’s matrix classes, compatibility with existing end-user code requires that variables can not be relied upon to be aligned, and thus unaligned accesses are used throughout. This obviates the need for end-user code to be updated for alignment restrictions.

Many of the usual suspects from SSE have extended equivalents in the AVX world: _mm_add_ps() becomes _mm256_add_ps(), _mm_addsub_ps() becomes _mm256_addsub_ps(), and so on.

Detection of AVX on your CPU.

Detection of AVX can be done using the CPUID instruction. However, unlike SSE3 and SSE4x, the introduction of AVX not only added new instructions to the processor.  It also added new state, due to the wider vector registers. Consequently, just knowing that your CPU can do AVX isn’t enough.

You also need the operating system to support the extension, because the extra state in the processor must be saved and restored when the Operating System preempts your process. Consequently, executing AVX instructions on an Operating System which does not support it will likely result in a “Illegal Instruction” exception.  To put it bluntly, your program will core-dump!

Fortunately, Operating System support for AVX is now also available through CPUID. There are three steps involved:

  1. Check AVX support, using CPUID function code #1.  The ECX and EDX registers are used to return a number of feature bits, various extensions to the core x86 instruction set.  The one we’re looking for in this case is ECX bit #28. If on, we’ve got AVX in the hardware.
  2. Next, Intel recommends checking ECX bit #27. This feature bit represents the OSXSAVE feature. XSAVE is basically a faster way to save processor state; if not supported by the O.S. then AVX is likely not available.
  3. Finally, a new register is available in the CPU indicating the Operating System has enabled state-saving the full AVX state. Just like the processor tick counter, this register can be obtained using a new magic instruction: XGETBV. The XGETBV populates the EAX:EDX register pair with feature flags indicating processor state the Operating System is aware of. At this time, x86 processors support three processor-state subsets: x87 FPU state, SSE state, and AVX state.  This information is represented by three bits in the EAX register.  For AVX, bit #2 indicates the Operating System indeed saves AVX state and has enabled AVX instructions to be available.

All this sounds pretty complicated, unless you’re an assembly-language programmer.  So, to make life a bit easier, the FOX CPUID API’s have been updated to do some of this hard work for you.

To perform simple feature tests, use the new fxCPUFeatures() API. It returns bit-flags for most instruction-sets added on top of plain x86. In the case of AVX, it simply disables AVX, AVX2, FMA, XOP, and FMA4 if the operating system does not support the extended state.

More on AVX in subsequent posts.

 

This entry was posted in FOX, Programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *