## Modified Rodrigues Parameters

Modified Rodrigues Parameters are a compact method to describe orientations. We’re all familiar with 3×3 matrices, and most of us also know about quaternions.
While 3×3 matrices require 9 numbers, quaternions only require 4 numbers; so, a great savings in space is obtained using quaternions. In addition, quaternions are also nice to work with.

However, quaternions are not the most compact form for representing rotations in three dimensions; only three numbers are actually needed. One can encode the axis of rotation in a unit vector, and the angle can then be represented by its length. Obviously, different methods can be used to encode the angle.

In Modified Rodrigues Parameters (MRP) representation, the orientation is encoded as:

```MRP = E * tan(phi/4)
```

where E is the unit vector (e1,e2,e3) describing the axis of rotation, and phi is the angle of rotation about this axis.
To be able to use the MRP, we often need to convert it to different representations, a quaternion, or a 3×3 rotation matrix.

The quaternion Q=(q1,q2,q3,q4) can be obtained from MRP m=(m1,m2,m3) as follows:

```q1 = 2 * m1 * D
q2 = 2 * m2 * D
q3 = 2 * m3 * D
q4 = (1-|m|*|m|) * D
```

where D is given as:

```D = 1 / (1 + |m|*|m|)
```

and |m| is the length of the MRP vector m as in:

```|m| = sqrt(m1*m1+m2*m2+m3*m3),
```

Converting a quaternion to Modified Rodrigues Parameter (MRP) m is even easier:

```m1 = q1 / k
m2 = q2 / k
m3 = q3 / k
```

where k is defined as:

```k = 1 / (1 + q4)
```

In our definition of the quaternion above, the real component is the last number.

Convert MRP to the 3×3 matrix M is quite a bit trickier, but fortunately, this procedure is adequately documented in some publications. The following expressions can be used:

```M = I + a*X + b*(X*X)
```

where I is the 3×3 identity matrix, and X is 3×3 the skew-symmetric matrix:

```      0   -m3   m2
X =  m3   0   -m1
-m2   m1   0
```

and, with a and b calculated as:

```a = 4 * (1-|m|*|m|) / ((1+|m|*|m|) * (1+|m|*|m|))
b = 8 / ((1+|m|*|m|) * (1+|m|*|m|))
```

The above equations are detailed in the paper “A Survey of Attitude Representations”, Malcolm D. Shuster, J. Astronautical Sciences, Vol. 41, No. 4. Oct-Dec. 1993.

Converting a 3×3 matrix to Modified Rodrigues Parameter representation, i.e. the opposite of the procedure above, turned out to be not documented in any literature I’ve been able to find. So, taking a stab at it by using our existing knowledge to extract axis and angle from a 3×3 rotation matrix M:

```cos(phi) = (m11+m22+m33-1)/2
r1  = (m32-m23)
r2  = (m13-m31)
r3  = (m21-m12)
```

Note that R=(r1,r2,r3) has length of 2*sin(phi), so at this point in time we have both
sin(phi) and cos(phi). Thus, angle phi can be extracted by:

```phi=atan2(|R|/2, (m11+m22+m33-1)/2)
```

which can be simplified to:

```phi=atan2(|R|, m11+m22+m33-1)
```

Then the MRP m can be obtained by:

```m1 = r1*tan(phi/4)/|R|
m2 = r2*tan(phi/4)/|R|
m3 = r3*tan(phi/4)/|R|
```

The above solution works quite well; however, it involves expensive calls to tan() and atan2(), which we’d rather avoid.
Fortunately, we can in fact avoid them! First, recall that:

```cos(phi/2) = (1/2) * sqrt(m11+m22+m33+1)
```

Now, apply half-angle formula for cos(), which yields the following identities:

```cos^2(phi/4) = (1 + (1/2) * sqrt(m11+m22+m33+1))/2
sin^2(phi/4) = (1 - (1/2) * sqrt(m11+m22+m33+1))/2
```

then we easily obtain the required:

```tan(phi/4) = sqrt( (1-sqrt(m11+m22+m33+1)/2) / (1+sqrt(m11+m22+m33+1)/2))
```

With some manipulation, we can produce the most compact calculations yet:

```r1  = (m32 - m23)
r2  = (m13 - m31)
r3  = (m21 - m12)
RR = r1*r1 + r2*r2 + r3*r3
C = sqrt(m11 + m22 + m33 + 1)/2
t = sqrt((1 - C) / ((1 + C) * RR))
m1 = r1 * t
m2 = r2 * t
m3 = r3 * t
```

This probably represents the fastest method to convert 3×3 matrix into MRP possible;
note, however, a practical implementation should be wary of |R| = 0, which represents
the non-rotation case. However in this case the answer is simply the 0-vector.

## Scanning UTF-8 Unicode Text

Unicode text may be encoded as 32-bit, 16-bit, or plain 8-bit units in the computer. Of these representations, the 8-bit encoding has become quite popular, not in the least because it has a few very nice properties:

1. Its very compact, taking up up very little extra space especially for most western characters.
2. It is insensitive to byte order issues since each character is a sequence of bytes, in the same order regarless of byte-endianness of the processor.
3. UTF-8 is easily processed by software that processes text as bytes already.  In most cases, software may stay oblivious to the fact that its processing UTF-8 instead of ASCII.
4. The encoding is such that the first character starting a multi-byte sequence can be recognized as such, without having to find the start of the string.  A very important property in a world where some data may get clobbered in transmission.

With current Unicode character set, code points may be encoded in 1 to 4 bytes, where plain US-ASCII remains encoded as just one byte [this is why UTF-8 is so friendly to old software].

Below is a little table that illustrates how various Unicode characters are encoded:

```Hex Range                 Binary                          Encoding
-----------------------   -----------------------------   -----------------------------------
U-00000000 - U-0000007F   0000 0000 0000 0000 0xxx xxxx   0xxxxxxx
U-00000080 - U-000007FF   0000 0000 0000 0yyy yyxx xxxx   110yyyyy 10xxxxxx
U-00000800 - U-0000FFFF   0000 0000 zzzz yyyy yyxx xxxx   1110zzzz 10yyyyyy 10xxxxxx
U-00010000 - U-001FFFFF   000u uuzz zzzz yyyy yyxx xxxx   11110uuu 10zzzzzz 10yyyyyy 10xxxxxx
```

As you can see, the leading UTF-8 byte of an encoding may be recognized as its most significant bits are either “0x” or “11”.  Bytes with most significant bits equal to “10” are always followers.

Whereas most software can continue to work on text on a byte-by-byte bases, some software does need to be aware of where one character-encoding sequence starts and the next one begins.

This should be quite straightforward, just scan forward or backward until the leading byte of a sequence is found.

This can be done with a very simple test:

```1 2 3 if((byte & 0xC0)!=0x80){ /* byte is a leading byte of a sequence */ }```

Of course, scanning through the text this way requires touching every byte in the sequence.  To scan backwards, we really have no choice but to do it this way; but scanning forwards through UTF-8 text strings can fortunately be made a little faster.

To realize how, look carefully at the encoding of the leading byte: it can tell you how many more bytes will follow!

Thus, one way to step forward through UTF-8 text quickly would be to simply keep a table of encoding lengths:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 const unsigned char codesize={ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1 };```

[Yes, the encoding allows for sequences up to 6 bytes; these are however not representing defined Unicode characters as there are only 1,114,112 code points reserved].  So ensure a string scanner won’t get stuck, illegal leading-byte patterns have been given a length of 1 so we’d at least advance to the next byte.

One would use the table as follows:

```1 2 byte=string[index]; index+=codesize[byte];```

or simply as :

```1 index+=codesize[string[index]];```

All this seems just fine, but of course it takes a table of numbers; its not a large table, that’s true, but accessing it might entail some extra memory references [esp. if the table hasn’t been cached yet]. Nevertheless, this may be a pretty fast way to get through the text being scanned, as not all bytes of a character need to be visited.

There is a way to ditch the table, and quickly compute the size without using loops:

```1 2 byte=string[index]; index+=((byte >= 0xC0)+(byte >= 0xE0)+(byte >= 0xF0) + 1u);```

This code-fragment is branch-free on x86 processors, which allow the result of a comparison to be turned into a 1 or a 0, which can then be added together.  However, on some other processors this trick may not work; if conditional branches are involved, the code about may not perform as well as hoped.

Back to the codesize table method.  How big does the table actually have to be, for legal Unicode code points? It turns out, the answer is 16, with each entry being only two bits.

Since we can only count to 3 with two bits, we’re short by one.  However, realize we’re always stepping at least one byte forward, so if we let the table encode only how many extra bytes to step beyond the leader byte, two bits will be enough.

With 16 entries of 2 bits each, the entire table will fit in a single 32-bit integer.  This is bread-and-butter for current processors.

Left to work out is how we will index into this table using the top few bits of the leading UTF-8 byte.  The answer is quite simple, with some shift operations:

```1 2 byte=string[index]; index+=((0xE5000000 >>((byte>>4)<<1)) & 3)+1;```

The code above works well if your processor has a fast shifter that can perform arbitrary (not compile-time determined) shifts; this is probably the case for most processors sold today (x86, ARM, etc).

It is not clear which of these methods is best; this will be highly processor-dependent.  However, I believe the last method is likely to be the fastest portable method, as only shift, addition, and and masking are used; there operations tend to be present on most modern CPUs.

## Making a virtue out of sin

Signal processing often requires us to perform a sin(). Indeed, we often have to perform a cos(), too! In fact, we often need to do a lot of them, and do them quickly too! In case this wasn’t clear, we’re talking about those wavy things: The C Math Library of course contains perfectly fine sin() and cos() functions. However, accepting arbitrary arguments and performing lots of error checking makes them kinda slow.

The quest was on to come up with something better. One could replace sin() and cos() with a big lookup table, and interpolate between entries. This works, if you don’t mind the vast quantities of memory thus wasted, or the algorithmic complexity of reducing the argument to the right quadrants, and so on.

Fortunately, better ways are possible with small polynomials, particularly if we can limit the argument a little bit. Since sin() and cos() are repeating endlessly, restricting the arguments to the repetition interval seems convenient.

How accurate should it be? Well, that depends. For double precision, the least significant bit falls at about 1E-16, so an algorithm that gets down to that level should be good enough even for most stuff.

For the sin() function, we get this:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 double mysin(double x){ const double A = -7.28638965935448382375e-18; const double B = 2.79164354009975374566e-15; const double C = -7.64479307785677023759e-13; const double D = 1.60588695928966278105e-10; const double E = -2.50521003012188316353e-08; const double F = 2.75573189892671884365e-06; const double G = -1.98412698371840334929e-04; const double H = 8.33333333329438515047e-03; const double I = -1.66666666666649732329e-01; const double J = 9.99999999999997848557e-01; register double x2 = x*x; return (((((((((A*x2+B)*x2+C)*x2+D)*x2+E)*x2+F)*x2+G)*x2+H)*x2+I)*x2+J)*x; }```

For the cos() function, we get:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 double mycos(double x){ const double c1 = 3.68396216222400477886e-19; const double c2 = -1.55289318377801496607e-16; const double c3 = 4.77840439714556611532e-14; const double c4 = -1.14706678499029860238e-11; const double c5 = 2.08767534780769871595e-09; const double c6 = -2.75573191273279748439e-07; const double c7 = 2.48015873000796780048e-05; const double c8 = -1.38888888888779804960e-03; const double c9 = 4.16666666666665603386e-02; const double c10 = -5.00000000000000154115e-01; const double c11 = 1.00000000000000001607e+00; register double x2=x*x; return (((((((((c1*x2+c2)*x2+c3)*x2+c4)*x2+c5)*x2+c6)*x2+c7)*x2+c8)*x2+c9)*x2+c10)*x2+c11; }```

Of course, often we want both sin() and cos(), and it is possible to get both, for the price of one, thanks to SSE2:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 void mysincos(double x,double* psin,double* pcos){ const __m128d c1 =_mm_set_pd( 3.68396216222400477886e-19,-7.28638965935448382375e-18); const __m128d c2 =_mm_set_pd(-1.55289318377801496607e-16, 2.79164354009975374566e-15); const __m128d c3 =_mm_set_pd( 4.77840439714556611532e-14,-7.64479307785677023759e-13); const __m128d c4 =_mm_set_pd(-1.14706678499029860238e-11, 1.60588695928966278105e-10); const __m128d c5 =_mm_set_pd( 2.08767534780769871595e-09,-2.50521003012188316353e-08); const __m128d c6 =_mm_set_pd(-2.75573191273279748439e-07, 2.75573189892671884365e-06); const __m128d c7 =_mm_set_pd( 2.48015873000796780048e-05,-1.98412698371840334929e-04); const __m128d c8 =_mm_set_pd(-1.38888888888779804960e-03, 8.33333333329438515047e-03); const __m128d c9 =_mm_set_pd( 4.16666666666665603386e-02,-1.66666666666649732329e-01); const __m128d c10=_mm_set_pd(-5.00000000000000154115e-01, 9.99999999999997848557e-01); const __m128d c11=_mm_set_pd( 1.00000000000000001607e+00, 0.00000000000000000000e+00); register __m128d x1x1=_mm_set1_pd(x); register __m128d x2x2=_mm_mul_pd(x1x1,x1x1); register __m128d x1x2=_mm_unpacklo_pd(x1x1,x2x2); register __m128d rr=c1; rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c2); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c3); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c4); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c5); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c6); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c7); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c8); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c9); rr=_mm_add_pd(_mm_mul_pd(rr,x2x2),c10); rr=_mm_add_pd(_mm_mul_pd(rr,x1x2),c11); _mm_storeh_pd(pcos,rr); _mm_storel_pd(psin,rr); }```

This uses Intel Performance Primitives, which should work on a variety of C/C++ compilers. Of course, it goes without saying that if your code is to be remotely portable, you will need to surround this code with the appropriate preprocessor incantations so as to fall back on the vanilla implementation should the processor not be an x86 CPU but something else.

So, just how close are these functions to the C Math Library version? Well, GNUPlot to the rescue once more. Plotting the difference between the math library version and the polynomial version on the -pi to pi interval for the cos() and sin(), respectively:  Yes, they look blocky! This is not a bug! They’re close enough that the error is in the last few significant bits of the result, so subtracting these numbers will be very, very close to zero.

Of course, these sin() and cos() replacements only work if the argument is in the -pi to pi interval. If you can ensure your argument is within this range, you’re done. If it isn’t, however, the following little inline will help:

```1 2 3 inline double wrap(double x){ return x-nearbyint(x*0.159154943091895335768883763373)*6.28318530717958647692528676656; }```

A call to nearbyint() compiles to a single instruction on the x86 CPU, and so its pretty fast.

The magic constants come from a nice paper on the topic, “Fast Polynomial Approximations to Sine and Cosine,” by Charles K Garrett, Feb 2012.

## Fast, Vectorizable Arc Tangent

Arctangent(x) (or atan(x)) is an often-used function to obtain the angle given the slope. Closely related is atan2(y,x) which usually returns the full angle.
While the library implementation is usually not horrible in terms of performance, it can still be beneficial to have a special implementation so vectorization is possible.
I’m not going to give the vector version here, but a really fast scalar version. When inlined, chances are good your compiler may auto-vectorize them. However, implementation using Intel intrinsics should be straighforward if your compiler doesn’t auto-vectorize.
Thus, the version presented here may become 4x faster with SSE, 8x faster with AVX, and 16x faster with the new AVX512 ISA extension.
Regardless, the versions presented here are quite a bit faster than the library functions, even before being inlined.

The trick to obtaining high performance is to eliminate conditional branches from the code; thus, don’t expect proper error handling for NaN’s and so on. If that is important to you, stick with the library version. However, if you need speed, read on!

Lets start with atan(x). To get good accuracy, we use the following identities to reduce the argument, use a polynomial approximation, then reverse the effects of the argument reduction prior to delivering the final result.
First, a recap of some of the common identities:

• atan(-x) = -atan(x)
• atan(1/x) = 0.5*pi-atan(x), for x>0

The first one is easy:

```1 double z=fabs(x)```

Next, we use a nice trick to reduce the argument z down to the 0..1 range:

```2 double a=fmin(z,1.0)/fmax(z,1.0);```

This expression yields a=z if z<1, and a=1/z if z>1. It does this without branches, and as we know from previous articles, this is important as conditional branches don’t vectorize.

The next step, we run our a through a polynomial (more about that later).

```3 4 double s=a*a; double p=a+a*s*(c3+s*(c5+...c39)); // More details to follow```

Now we need to reverse the argument reduction (2nd identity), for which we recall the handy blend() function from the previous article:

```5 p=blend(1.0,z,0.5*pi-p,p); // p = 1<z ? pi/2 - p : p```

The final step is to reverse the sign removal (1st identity). We use the standard copysign() function for this:

```6 result=copysign(p,x); // result = 0<x ? p : -p```

And we’re done!

Of course, we swept something under the rug, how to get the polynomial approximation for atan(x) on the limited interval for x=0…1.

The best mathematical technique is a so-called Chebyshev polynomial. These polynomials have the nice property of minimizing the maximum error between the true function and the approximation, over the entire interval. Thus, they’re a lot better than Taylor approximations, which are only accurate near a particular value for x.

However, due to numerical errors, the mathematically pure Chebyshev polynomial may not give the most accurate answer. Thankfully, some brilliant people at Inria developed a nice piece of software (http://sollya.gforge.inria.fr/) to calculate the coefficients for us, given a desired numerical precision.

For double precision, we want accuracy down to about 1E-16 (as we have only 16 digits). For single precision we should shoot for the 1E-6 or 1E-7 ballpark. Beyond that, we would include more polynomial terms but not actually improve numerical results.

So, without further ado, I present here the double precision version:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 // Fast arctan(x) algorithm // Error in the order of ~2.22E-16 double fastatan(double x){ const double c3 =-3.3333333333331788272957396657147910445928573608398E-1; const double c5 = 1.9999999999746972956238266760919941589236259460449E-1; const double c7 =-1.4285714270985122587021010076568927615880966186523E-1; const double c9 = 1.1111110670649392007103273272150545381009578704834E-1; const double c11=-9.0909011195370925673131523581105284392833709716797e-2; const double c13= 7.6922118180920429075797528639668598771095275878906e-2; const double c15=-6.6658528038443493057840782967105042189359664916992e-2; const double c17= 5.8772701139760089028563072588440263643860816955566e-2; const double c19=-5.2390921524556287314222657869322574697434902191162e-2; const double c21= 4.6735478230248365949517364015264320187270641326904e-2; const double c23=-4.0917561483705074121264289033206296153366565704346e-2; const double c25= 3.4052860223616393531287371843063738197088241577148e-2; const double c27=-2.5807287359851206060001871378517535049468278884888e-2; const double c29= 1.6958625295544118433133107259891403373330831527710e-2; const double c31=-9.1701096131817233514382792236574459820985794067383e-3; const double c33= 3.8481862661788874928336934289063719916157424449921e-3; const double c35=-1.1612018409582773505878128261770143581088632345199e-3; const double c37= 2.2237585554124372289389044432539321860531345009804e-4; const double c39=-2.0191399088793571194207221441985211640712805092335e-5; double z=fabs(x); // z=|x| double a=fmin(z,1.0)/fmax(z,1.0); // a = 1<z ? 1/z : z double s=a*a; // a^2 double p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39)))))))))))))))))); p=blend(1.0,z,1.57079632679489661923132-p,p); // r = 1<z ? pi/2 - p : p return copysign(p,x); }```

Since single precision has much less accuracy, we may implement the that version with many fewer terms, and yet still match the library version quite closely.

Behold the single-precision version:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 // Fast arctan(x) algorithm // Error in the order of ~2.4E-7 float fastatanf(float x){ const float c3 =-3.33319157361984252929687E-1f; const float c5 = 1.99689209461212158203125E-1f; const float c7 =-1.40156880021095275878906E-1f; const float c9 = 9.90508347749710083007812E-2f; const float c11=-5.93664981424808502197265E-2f; const float c13= 2.41728331893682479858398E-2f; const float c15=-4.67213569208979606628417E-3f; float z=fabsf(x); // z=|x| float a=fminf(z,1.0f)/fmaxf(z,1.0f); // a = 1<z ? 1/z : z float s=a*a; // a^2 float p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15)))))); p=blend(1.0f,z,1.57079632679489661923132f-p,p); // r = 1<z ? pi/2 - p : p return copysignf(p,x); }```

So, just how fast are these functions compared to the c-library equivalents?
Keeping in mind that the measurements often influence the result, I’ve clocked the non-inlined version of these functions at around 25 cycles (on 4 GHz Skylake processor). The library version of atan() was clocked around 64 cycles, while the library version of atanf() was a surprisingly good 33 cycles.
But keep in mind, with inlining and vectorization the new functions may become substantially faster still, depending on the vector length. Thus, for AVX we should be able to expect 8x speed improvement for the single-precision atanf(), or around 3.125 cycles per argument.

The other interesting function we can now implement, with very minor alterations, is atan2(y,x). Its my favorite since you get the full angle.

The code is very similar, but we need a bit more work to reverse the argument reduction. Fortunately, blend() can be used for all of these.

The full code for the double precision version is below:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // Fast arctan(x) algorithm // Error in the order of ~4.44E-16 double fastatan2(double y,double x){ const double c3 =-3.3333333333331788272957396657147910445928573608398E-1; const double c5 = 1.9999999999746972956238266760919941589236259460449E-1; const double c7 =-1.4285714270985122587021010076568927615880966186523E-1; const double c9 = 1.1111110670649392007103273272150545381009578704834E-1; const double c11=-9.0909011195370925673131523581105284392833709716797e-2; const double c13= 7.6922118180920429075797528639668598771095275878906e-2; const double c15=-6.6658528038443493057840782967105042189359664916992e-2; const double c17= 5.8772701139760089028563072588440263643860816955566e-2; const double c19=-5.2390921524556287314222657869322574697434902191162e-2; const double c21= 4.6735478230248365949517364015264320187270641326904e-2; const double c23=-4.0917561483705074121264289033206296153366565704346e-2; const double c25= 3.4052860223616393531287371843063738197088241577148e-2; const double c27=-2.5807287359851206060001871378517535049468278884888e-2; const double c29= 1.6958625295544118433133107259891403373330831527710e-2; const double c31=-9.1701096131817233514382792236574459820985794067383e-3; const double c33= 3.8481862661788874928336934289063719916157424449921e-3; const double c35=-1.1612018409582773505878128261770143581088632345199e-3; const double c37= 2.2237585554124372289389044432539321860531345009804e-4; const double c39=-2.0191399088793571194207221441985211640712805092335e-5; double absx=fabs(x); // |x| double absy=fabs(y); // |y| double a=fmin(absx,absy)/fmax(absx,absy); // a = |x|<|y| ? |x|/|y| : |y|/|x| double s=a*a; double r; r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39)))))))))))))))))); r=blend(absy,absx,r,1.57079632679489661923132-r); r=blend(0.0,x,r,3.1415926535897932384626433832795028841971693993751-r); r=blend(0.0,y,r,-r); return r; }```

The single-precision version atan2f(y,x) is almost identical, except for the number of terms in the polynomial:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 // Fast arctan(x) algorithm // Error in the order of ~4.8E-7 float fastatan2f(float y,float x){ const float c3 =-3.33319157361984252929687E-1f; const float c5 = 1.99689209461212158203125E-1f; const float c7 =-1.40156880021095275878906E-1f; const float c9 = 9.90508347749710083007812E-2f; const float c11=-5.93664981424808502197265E-2f; const float c13= 2.41728331893682479858398E-2f; const float c15=-4.67213569208979606628417E-3f; float absx=fabsf(x); // |x| float absy=fabsf(y); // |y| float a=fminf(absx,absy)/fmaxf(absx,absy); // a = |x|<|y| ? |x|/|y| : |y|/|x| float s=a*a; float r; r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15)))))); r=blend(absy,absx,r,1.57079632679489661923132f-r); r=blend(0.0f,x,r,3.1415926535897932384626433832795028841971693993751f-r); r=blend(0.0f,y,r,-r); return r; }```

The new atan2() fares very well against the standard library implementation: 25 cycles vs 138 cycles. The single-precision version clocks in at 26 cycles vs 71 cycles for the standard library. All with the non-inlined versions.

For those for which these functions are still not fast enough, I have a couple of suggestions:

• Use inlined versions of these functions.
• Make sure your target architecture model instruction-set extensions include SSE4.1 (for blend) or better yet, AVX and AVX2. If you’re blessed with a brand-new AVX512 capable machine, make sure its ISA extension is enabled when you compile.
• Rearranging polynomial evaluation using e.g. Estrin scheme may yield additional improvements.

The error w.r.t. the standard c-library is very minial, but keep in mind the code above does not handle floating point oddities like NaN’s and Infinity and such.

Just feed them nice numbers and you should be OK.

## Branch-Free Blend()

Often, we want to evaluate simple expressions like:

```1 result=(a<b) ? x : y```

Clearly, this implies performing a conditional branch. This is not a huge problem, unless the code fragment absolutely needs to be as fast as possible.
A branch has several performance-impacting problems:

• Due the uncertain outcome of the condition, processor will be wrong predicting path through the code due to having to evaluate the outcome of the condition.
• Automatic vectorization will be difficult, as different vector columns will have different condition-outcomes.

If we descend into the Intel assembly realm, we find that comparisons of floating point numbers leaves an all-zeros or all-ones mask in the SSE or AVX register. Thus, we can make this code branch-free by combining the comparison with some logical masking:

```1 2 3 __m128d result,c,a,b,x,y; c=_mm_cmplt_sd(a,b); result=_mm_or_pd(_mm_and_pd(c,x),_mm_andnot_pd(c,y));```

Later Intel processors supporting SSE4.1 include a new _mm_blendv_pd() operation which will combine these three instructions into one single one:

```1 2 3 __m128d result,c,a,b,x,y; c=_mm_cmplt_sd(a,b); result=_mm_blendv_pd(y,x,c);```

To make this a little bit more palatable for the casual user who isn’t too familiar with Intel intrinsic programming, I’ve wrapped these into some more accessible form:

```1 2 3 4 5 6 7 8 9 10 11 // Evaluate branch-free (a < b) ? x : y, if supported on processor static inline double blend(double a,double b,double x,double y){ #if defined(__SSE4_1__) return _mm_cvtsd_f64(_mm_blendv_pd(_mm_set_sd(y),_mm_set_sd(x),_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b)))); #elif defined(__SSE2__) __m128d cc=_mm_cmplt_sd(_mm_set_sd(a),_mm_set_sd(b)); return _mm_cvtsd_f64(_mm_or_pd(_mm_and_pd(cc,_mm_set_sd(x)),_mm_andnot_pd(cc,_mm_set_sd(y)))); #else return a<b ? x : y; #endif }```

And an equivalent version for single precision math:

```1 2 3 4 5 6 7 8 9 10 11 // Evaluate branch-free (a < b) ? x : y, if supported on processor static inline float blend(float a,float b,float x,float y){ #if defined(__SSE4_1__) return _mm_cvtss_f32(_mm_blendv_ps(_mm_set_ss(y),_mm_set_ss(x),_mm_cmplt_ps(_mm_set_ss(a),_mm_set_ss(b)))); #elif defined(__SSE__) __m128 cc=_mm_cmplt_ss(_mm_set_ss(a),_mm_set_ss(b)); return _mm_cvtss_f32(_mm_or_ps(_mm_and_ps(cc,_mm_set_ss(x)),_mm_andnot_ps(cc,_mm_set_ss(y)))); #else return a<b ? x : y; #endif }```

When compiled with optimization, these little routines perform very well indeed. In addition, GCC is clever enough to auto-vectorize when these routines are in a loop.
Of course, for this to work, make sure these functions are declared in a header file.
On machines where _mm_blendv_pd() and _mm_blendv_ps() are available, the resulting output is only two instructions: the compare and the blend.

In the previous example we used a nice little trick which is not available until AVX came to the scene: masked loads and stores.
In this note we’ll go a little deeper into the use of masked loads and stores, and how it can greatly help in handling left-overs after vector loops, as well as dealing with data structures that are simply not a whole multiple of the natural vector size.

```1 2 3 4 5 6 7 8 9 10 11 // Define a mask for double precision 3d-vector #define MMM _mm256_set_epi64x(0,~0,~0,~0)   // We want to do c=a+b vector-sum FXVec3d a,b,c;   // Set a and b somehow   // Use AVX intrinsics _mm256_maskstore_pd(&c,MMM,_mm256_add_pd(_mm256_maskload_pd(&a,MMM), _mm256_maskload_pd(&b,MMM)));```

This was pretty easy, right? Note that Intel defined those masked-loads and stores in such a way that the store locations are not touched; i.e. they’re not simply loaded and written back with the original values, but never loaded. This is important as you don’t want to incur segmentation violations when your vector happens to be the last thing in a memory page!

Next, we move on to a little more sophisticated use, the wrap-up of left-overs of a vector loop; note that with the masked load and store, we can typically perform the last operation in vector mode as well; we don’t have to resort to plain scalar code like you had to do with SSE.

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 void add_vectors(float* result,const float* src1,const float src2,int n){ register __m256i mmmm; register __m256 aaaa,bbbb,rrrr; register int i=0;   // Vector loop adds 8 paits of floats at a time while(i<n-8){ aaaa=_mm256_loadu_ps(&a[i]); bbbb=_mm256_loadu_ps(&b[i]); rrrr=_mm256_add_ps(aaaa,bbbb); _mm256_storeu_ps(&result[i],rrrr); i+=8; }   // Load the mask at index n-i; this should be in the range 0...8. mmmm=_mm256_castps_si256(_mm256_load_ps((const float*)mask8i[n-i]));   // Use masked loads aaaa=_mm256_maskload_ps(&a[i],mmmm); bbbb=_mm256_maskload_ps(&b[i],mmmm);   // Same vector operation as main loop rrrr=_mm256_add_ps(aaaa,bbbb);   // Use masked store _mm256_maskstore_ps(&result[i],mmmm,rrrr); }```

Note that the loop goes goes one vector short if n is a multiple of eight: since the mop-up code is executed unconditionally, we’d rather to this with actual payload, not with all data masked out.
Also note that we don’t have a special case for n==0. In the rare case that this happens, we will just execute the mop-up code with an all-zeroes mask!

Left to do is build a little array with mask values; due to the observation above, this will have 9 entries, not 8!

```1 2 3 4 5 6 7 8 9 10 11 static __align(32) const int mask8i={ { 0, 0, 0, 0, 0, 0, 0, 0}, {-1, 0, 0, 0, 0, 0, 0, 0}, {-1,-1, 0, 0, 0, 0, 0, 0}, {-1,-1,-1, 0, 0, 0, 0, 0}, {-1,-1,-1,-1, 0, 0, 0, 0}, {-1,-1,-1,-1,-1, 0, 0, 0}, {-1,-1,-1,-1,-1,-1, 0, 0}, {-1,-1,-1,-1,-1,-1,-1, 0}, {-1,-1,-1,-1,-1,-1,-1,-1} };```

In the above, the __align() macro is fleshed out differently depending on your compiler; however it should ensure that the array is aligned to a multiple of 32 bytes (the size of an AVX vector).

Bottom line: the new AVX masked loads solve a real problem that was always a bit awkward to solve before; it allows mop-up code to be vectorized same as the main loop, which is important as you may want to ensure the last couple of numbers get “the same treatment” as the rest of them.

## FUN With AVX

Over the last few years, Intel and AMD have added 256-bit vector-support to their processors. The support for these wider vectors is commonly known as AVX (Advanced Vector eXtension).

Since wider vectors also introduce more processor-state, in order to use these features its not enough to have a CPU capable of these AVX vectors, but also your operating system and compiler need to be aware of it.

For maximum portability, I recommend using the Intel Intrinsics. These are supported by GCC, LLVM, as well as late-model Microsoft and Intel compilers. The advantage of using Intrincics are:

1. Its more easy to work with for the developer, since you can embed these in your regular C/C++ code.
2. The “smarts” of the compiler regarding register-assignments, common subexpression elimination, and other data-flow analysis goodies are at your disposal.
3. Target architecture setting of the compiler will automatically use the new VEX instruction encoding, even for code originally written with SSE in mind.

The matrix classes in FOX were originally vectorized for SSE (actually, SSE2/SSE3). Compiling with -mavx will automatically kick in the new VEX encoding for these same SSE intrinsics.  This is nice because AVX supports three-operand instuctions (of the form A = B op C)  rather than the old two-operand instructions (A = A op B).  This means you can typically make do with fewer registers, and quite possibly eliminate useless mov instructions. Your code will be correspondingly smaller and faster, with no work at all!

However, this of course does not fully exploit the new goodies AVX brings to the table.  The most obvious benefit is the wider vectors, which of course means you can work with twice as much data as before.

For example, given 4×4 double precision matrix a and b, we can now add a and b like:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 __m256d a,b,c; a=_mm256_loadu_pd(a); b=_mm256_loadu_pd(b); c=_mm256_add_pd(a,b); _mm256_storeu_pd(r,c); a=_mm256_loadu_pd(a); b=_mm256_loadu_pd(b); c=_mm256_add_pd(a,b); _mm256_storeu_pd(r,c); a=_mm256_loadu_pd(a); b=_mm256_loadu_pd(b); c=_mm256_add_pd(a,b); _mm256_storeu_pd(r,c); a=_mm256_loadu_pd(a); b=_mm256_loadu_pd(b); c=_mm256_add_pd(a,b); _mm256_storeu_pd(r,c);```

This little code fragment performs 16 double precision adds in just 4 vector instructions!
Note unlike the old SSE code, you now declare vector variables as __m256 (float), __m256i (integer), or __m256d (double).
The penalty for accessing unaligned memory addresses is much less for AVX, and thus we can use unaligned loads and stores, at a very modest (and usually not measurable) speed penalty. If you really want to go all out, however, remember to align things to 32 bytes now, not 16 bytes like you did for SSE!

For FOX’s matrix classes, compatibility with existing end-user code requires that variables can not be relied upon to be aligned, and thus unaligned accesses are used throughout. This obviates the need for end-user code to be updated for alignment restrictions.

Many of the usual suspects from SSE have extended equivalents in the AVX world: _mm_add_ps() becomes _mm256_add_ps(), _mm_addsub_ps() becomes _mm256_addsub_ps(), and so on.

Detection of AVX on your CPU.

Detection of AVX can be done using the CPUID instruction. However, unlike SSE3 and SSE4x, the introduction of AVX not only added new instructions to the processor.  It also added new state, due to the wider vector registers. Consequently, just knowing that your CPU can do AVX isn’t enough.

You also need the operating system to support the extension, because the extra state in the processor must be saved and restored when the Operating System preempts your process. Consequently, executing AVX instructions on an Operating System which does not support it will likely result in a “Illegal Instruction” exception.  To put it bluntly, your program will core-dump!

Fortunately, Operating System support for AVX is now also available through CPUID. There are three steps involved:

1. Check AVX support, using CPUID function code #1.  The ECX and EDX registers are used to return a number of feature bits, various extensions to the core x86 instruction set.  The one we’re looking for in this case is ECX bit #28. If on, we’ve got AVX in the hardware.
2. Next, Intel recommends checking ECX bit #27. This feature bit represents the OSXSAVE feature. XSAVE is basically a faster way to save processor state; if not supported by the O.S. then AVX is likely not available.
3. Finally, a new register is available in the CPU indicating the Operating System has enabled state-saving the full AVX state. Just like the processor tick counter, this register can be obtained using a new magic instruction: XGETBV. The XGETBV populates the EAX:EDX register pair with feature flags indicating processor state the Operating System is aware of. At this time, x86 processors support three processor-state subsets: x87 FPU state, SSE state, and AVX state.  This information is represented by three bits in the EAX register.  For AVX, bit #2 indicates the Operating System indeed saves AVX state and has enabled AVX instructions to be available.

All this sounds pretty complicated, unless you’re an assembly-language programmer.  So, to make life a bit easier, the FOX CPUID API’s have been updated to do some of this hard work for you.

To perform simple feature tests, use the new fxCPUFeatures() API. It returns bit-flags for most instruction-sets added on top of plain x86. In the case of AVX, it simply disables AVX, AVX2, FMA, XOP, and FMA4 if the operating system does not support the extended state.

More on AVX in subsequent posts.

## Fast Power of Two Test

A quick and nifty check to see if a number is a power of two is to note that for numbers that are a power of two, you’ll get a number which is all ones. For example:

```1000
-1
0111
```

Note that the resulting number has no bits in common with the original number; this property is only true for powers of two. For any other number, bits are left standing.
Thus, we can test for power-of-two-ness by:

```1 2 3 inline bool ispowof2(int x){ return (x&(x-1))==0; }```

Which is probably the fastest possible way to do this.

## The Case for Negative Semaphore Counts

The new FXParallel feature being worked on in FOX has a atomic variable for tracking completed tasks of a task-group.  This kind of works as follows:

```1 2 3 if(atomicAdd(&counter,-1)==1){ completion.post(); }```

In other words, when the specified number of tasks has completed, the counter reaches zero (atomicAdd() returns the previous value of the variable) and then we ping the semaphore.

Of course, behind the scenes, a semaphore is maintaining an atomic counter already! Maintaining our own atomic variable therefore would seem to be redundant.  Wouldn’t it be great if we could just initialize a semaphore to a negative value, and post() it until it reaches 1, and release the waiting thread from its call to wait()?

Sadly, you can’t initialize semaphores to a negative value.  But it would be great if you could, and (at least on Linux) it could be done with minimal effort. In fact, simply changing the semaphore variable to signed int and allowing it to be initialized to a negative value is really all it takes.

## Dell XPS15 Surgery

So I got this Dell XPS15 laptop, and decided to bump up the specs a bit.  A quick visit to NewEgg and I received a new 1TB laptop drive, two RAM sticks, and a nice 120GB SSD drive; the latter was especially nice since I needed some serious space to host both Fedora Linux and Windows 7 on this.

Now, this XPS15 is a great little machine, but service friendly it is not! Whereas my old clunker laptop had nice access panels for memory, drive, and batteries, the XPS15 needs deep surgery to get at these things.

Fortunately, a service manual exists that details the steps (and detailed they are!).  First, to get at the gizzards you will need a mini Torx #T5 screwdriver.  This is of course not something the average Joe has lying around, fortunately there are a few places that sell these.

After the Torx screwdrivers arrived, I managed to pry the bottom off. Along the way, discovered a secret button on the bottom, hidden under the rubber sheet, that apparently exists to diagnose battery charge levels.You’ll find this button under the front-left corner.

Once the bottom was off, another surprise awaited: in order to get to the second DIMM slot, you will need to remove the motherboard! And the motherboard won’t come off without removing the fan, the processor heat-pipe, the battery, the hard drive, the SSD, and detaching the miniscule connectors that hold all this together.

If this were a car, it would be French!

At any rate, once I got this far, of course I was committed! So off the system fan came, and the heat-pipe came, and the battery, and the hard drive, and the SSD, and all the miniscule connectors.

Finally, the hidden DIMM slot revealed itself, and the RAM was put in.  Now the challenge was to put humpty-dumpty together again. Needless to say, I was escpecially concerned about reinstalling the processor heat-sink. Some solvent was brought into action to clean off the remaining heat-conducting compound, and a fresh dollop of silver heat paste was applied.

The remaining dominoes quickly fell into place, except for one mysteriously missing screw neat the middle of the case.

Typical!  Usually, I have parts left over 😉

After a few fretful moments, it dawned on me that one of the case screws for the bottom doubled as fastener for the motherboard itself! After a few more sweaty moments the bottom was back on, and the moment of truth was upon me.

Press the power button…. And it fired right up!

Moral of the story: modern hardware is not really designed for upgradability. So, unless you have nerves of steel, just buy a loaded version right of the bat and don’t bother upgrading it!

Posted in Uncategorized | Comments Off on Dell XPS15 Surgery