# Fast, Vectorizable Arc Tangent

Arctangent(x) (or atan(x)) is an often-used function to obtain the angle given the slope. Closely related is atan2(y,x) which usually returns the full angle.
While the library implementation is usually not horrible in terms of performance, it can still be beneficial to have a special implementation so vectorization is possible.
I’m not going to give the vector version here, but a really fast scalar version. When inlined, chances are good your compiler may auto-vectorize them. However, implementation using Intel intrinsics should be straighforward if your compiler doesn’t auto-vectorize.
Thus, the version presented here may become 4x faster with SSE, 8x faster with AVX, and 16x faster with the new AVX512 ISA extension.
Regardless, the versions presented here are quite a bit faster than the library functions, even before being inlined.

The trick to obtaining high performance is to eliminate conditional branches from the code; thus, don’t expect proper error handling for NaN’s and so on. If that is important to you, stick with the library version. However, if you need speed, read on!

Lets start with atan(x). To get good accuracy, we use the following identities to reduce the argument, use a polynomial approximation, then reverse the effects of the argument reduction prior to delivering the final result.
First, a recap of some of the common identities:

• atan(-x) = -atan(x)
• atan(1/x) = 0.5*pi-atan(x), for x>0

The first one is easy:

```1 double z=fabs(x)```

Next, we use a nice trick to reduce the argument z down to the 0..1 range:

```2 double a=fmin(z,1.0)/fmax(z,1.0);```

This expression yields a=z if z<1, and a=1/z if z>1. It does this without branches, and as we know from previous articles, this is important as conditional branches don’t vectorize.

The next step, we run our a through a polynomial (more about that later).

```3 4 double s=a*a; double p=a+a*s*(c3+s*(c5+...c39)); // More details to follow```

Now we need to reverse the argument reduction (2nd identity), for which we recall the handy blend() function from the previous article:

```5 p=blend(1.0,z,0.5*pi-p,p); // p = 1<z ? pi/2 - p : p```

The final step is to reverse the sign removal (1st identity). We use the standard copysign() function for this:

```6 result=copysign(p,x); // result = 0<x ? p : -p```

And we’re done!

Of course, we swept something under the rug, how to get the polynomial approximation for atan(x) on the limited interval for x=0…1.

The best mathematical technique is a so-called Chebyshev polynomial. These polynomials have the nice property of minimizing the maximum error between the true function and the approximation, over the entire interval. Thus, they’re a lot better than Taylor approximations, which are only accurate near a particular value for x.

However, due to numerical errors, the mathematically pure Chebyshev polynomial may not give the most accurate answer. Thankfully, some brilliant people at Inria developed a nice piece of software (http://sollya.gforge.inria.fr/) to calculate the coefficients for us, given a desired numerical precision.

For double precision, we want accuracy down to about 1E-16 (as we have only 16 digits). For single precision we should shoot for the 1E-6 or 1E-7 ballpark. Beyond that, we would include more polynomial terms but not actually improve numerical results.

So, without further ado, I present here the double precision version:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 // Fast arctan(x) algorithm // Error in the order of ~2.22E-16 double fastatan(double x){ const double c3 =-3.3333333333331788272957396657147910445928573608398E-1; const double c5 = 1.9999999999746972956238266760919941589236259460449E-1; const double c7 =-1.4285714270985122587021010076568927615880966186523E-1; const double c9 = 1.1111110670649392007103273272150545381009578704834E-1; const double c11=-9.0909011195370925673131523581105284392833709716797e-2; const double c13= 7.6922118180920429075797528639668598771095275878906e-2; const double c15=-6.6658528038443493057840782967105042189359664916992e-2; const double c17= 5.8772701139760089028563072588440263643860816955566e-2; const double c19=-5.2390921524556287314222657869322574697434902191162e-2; const double c21= 4.6735478230248365949517364015264320187270641326904e-2; const double c23=-4.0917561483705074121264289033206296153366565704346e-2; const double c25= 3.4052860223616393531287371843063738197088241577148e-2; const double c27=-2.5807287359851206060001871378517535049468278884888e-2; const double c29= 1.6958625295544118433133107259891403373330831527710e-2; const double c31=-9.1701096131817233514382792236574459820985794067383e-3; const double c33= 3.8481862661788874928336934289063719916157424449921e-3; const double c35=-1.1612018409582773505878128261770143581088632345199e-3; const double c37= 2.2237585554124372289389044432539321860531345009804e-4; const double c39=-2.0191399088793571194207221441985211640712805092335e-5; double z=fabs(x); // z=|x| double a=fmin(z,1.0)/fmax(z,1.0); // a = 1<z ? 1/z : z double s=a*a; // a^2 double p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39)))))))))))))))))); p=blend(1.0,z,1.57079632679489661923132-p,p); // r = 1<z ? pi/2 - p : p return copysign(p,x); }```

Since single precision has much less accuracy, we may implement the that version with many fewer terms, and yet still match the library version quite closely.

Behold the single-precision version:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 // Fast arctan(x) algorithm // Error in the order of ~2.4E-7 float fastatanf(float x){ const float c3 =-3.33319157361984252929687E-1f; const float c5 = 1.99689209461212158203125E-1f; const float c7 =-1.40156880021095275878906E-1f; const float c9 = 9.90508347749710083007812E-2f; const float c11=-5.93664981424808502197265E-2f; const float c13= 2.41728331893682479858398E-2f; const float c15=-4.67213569208979606628417E-3f; float z=fabsf(x); // z=|x| float a=fminf(z,1.0f)/fmaxf(z,1.0f); // a = 1<z ? 1/z : z float s=a*a; // a^2 float p=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15)))))); p=blend(1.0f,z,1.57079632679489661923132f-p,p); // r = 1<z ? pi/2 - p : p return copysignf(p,x); }```

So, just how fast are these functions compared to the c-library equivalents?
Keeping in mind that the measurements often influence the result, I’ve clocked the non-inlined version of these functions at around 25 cycles (on 4 GHz Skylake processor). The library version of atan() was clocked around 64 cycles, while the library version of atanf() was a surprisingly good 33 cycles.
But keep in mind, with inlining and vectorization the new functions may become substantially faster still, depending on the vector length. Thus, for AVX we should be able to expect 8x speed improvement for the single-precision atanf(), or around 3.125 cycles per argument.

The other interesting function we can now implement, with very minor alterations, is atan2(y,x). Its my favorite since you get the full angle.

The code is very similar, but we need a bit more work to reverse the argument reduction. Fortunately, blend() can be used for all of these.

The full code for the double precision version is below:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 // Fast arctan(x) algorithm // Error in the order of ~4.44E-16 double fastatan2(double y,double x){ const double c3 =-3.3333333333331788272957396657147910445928573608398E-1; const double c5 = 1.9999999999746972956238266760919941589236259460449E-1; const double c7 =-1.4285714270985122587021010076568927615880966186523E-1; const double c9 = 1.1111110670649392007103273272150545381009578704834E-1; const double c11=-9.0909011195370925673131523581105284392833709716797e-2; const double c13= 7.6922118180920429075797528639668598771095275878906e-2; const double c15=-6.6658528038443493057840782967105042189359664916992e-2; const double c17= 5.8772701139760089028563072588440263643860816955566e-2; const double c19=-5.2390921524556287314222657869322574697434902191162e-2; const double c21= 4.6735478230248365949517364015264320187270641326904e-2; const double c23=-4.0917561483705074121264289033206296153366565704346e-2; const double c25= 3.4052860223616393531287371843063738197088241577148e-2; const double c27=-2.5807287359851206060001871378517535049468278884888e-2; const double c29= 1.6958625295544118433133107259891403373330831527710e-2; const double c31=-9.1701096131817233514382792236574459820985794067383e-3; const double c33= 3.8481862661788874928336934289063719916157424449921e-3; const double c35=-1.1612018409582773505878128261770143581088632345199e-3; const double c37= 2.2237585554124372289389044432539321860531345009804e-4; const double c39=-2.0191399088793571194207221441985211640712805092335e-5; double absx=fabs(x); // |x| double absy=fabs(y); // |y| double a=fmin(absx,absy)/fmax(absx,absy); // a = |x|<|y| ? |x|/|y| : |y|/|x| double s=a*a; double r; r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*(c15+s*(c17+s*(c19+s*(c21+s*(c23+s*(c25+s*(c27+s*(c29+s*(c31+s*(c33+s*(c35+s*(c37+s*c39)))))))))))))))))); r=blend(absy,absx,r,1.57079632679489661923132-r); r=blend(0.0,x,r,3.1415926535897932384626433832795028841971693993751-r); r=blend(0.0,y,r,-r); return r; }```

The single-precision version atan2f(y,x) is almost identical, except for the number of terms in the polynomial:

```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 // Fast arctan(x) algorithm // Error in the order of ~4.8E-7 float fastatan2f(float y,float x){ const float c3 =-3.33319157361984252929687E-1f; const float c5 = 1.99689209461212158203125E-1f; const float c7 =-1.40156880021095275878906E-1f; const float c9 = 9.90508347749710083007812E-2f; const float c11=-5.93664981424808502197265E-2f; const float c13= 2.41728331893682479858398E-2f; const float c15=-4.67213569208979606628417E-3f; float absx=fabsf(x); // |x| float absy=fabsf(y); // |y| float a=fminf(absx,absy)/fmaxf(absx,absy); // a = |x|<|y| ? |x|/|y| : |y|/|x| float s=a*a; float r; r=a+a*s*(c3+s*(c5+s*(c7+s*(c9+s*(c11+s*(c13+s*c15)))))); r=blend(absy,absx,r,1.57079632679489661923132f-r); r=blend(0.0f,x,r,3.1415926535897932384626433832795028841971693993751f-r); r=blend(0.0f,y,r,-r); return r; }```

The new atan2() fares very well against the standard library implementation: 25 cycles vs 138 cycles. The single-precision version clocks in at 26 cycles vs 71 cycles for the standard library. All with the non-inlined versions.

For those for which these functions are still not fast enough, I have a couple of suggestions:

• Use inlined versions of these functions.
• Make sure your target architecture model instruction-set extensions include SSE4.1 (for blend) or better yet, AVX and AVX2. If you’re blessed with a brand-new AVX512 capable machine, make sure its ISA extension is enabled when you compile.
• Rearranging polynomial evaluation using e.g. Estrin scheme may yield additional improvements.

The error w.r.t. the standard c-library is very minial, but keep in mind the code above does not handle floating point oddities like NaN’s and Infinity and such.

Just feed them nice numbers and you should be OK.

This entry was posted in FOX. Bookmark the permalink.