# Hidden Pitfalls of the Espresso Square Root Functions

When writing physics or collision code, it is natural to use the GHS `sqrt` or `sqrtf` functions to calculate a square root. Unfortunately, the Espresso CPU does not have a full precision square root function built into its instruction set. Instead, these functions are software emulated, which causes a large and unexpected loss in performance when called. Fortunately, the Espresso CPU does have an answer to this problem: the `frsqrte` and `fres` instructions and their intrinsic functions.

## The Math Estimate Instructions

### frsqrte Instruction Overview

The Floating-point Reciprocal Square Root Estimate (`frsqrte`) instruction takes a double-precision value and calculates a close approximation to the reciprocal of the square root of the floating-point value. The accuracy of the estimate is guaranteed to be within 1 part in 4096 of the reciprocal of the true square root value1. This quickly provides a close approximation of the reciprocal of the requested answer, but we still need it reciprocated for use as a square root value, which leads to the second instruction.

### fres Instruction Overview

The Floating-point Reciprocal Estimate (`fres`) instruction calculates the reciprocal of a single precision value. The accuracy of the calculation is guaranteed to be within 1 part in 4096 of the true value2. When paired with the square root estimate (`frsqrte`), the final value is a closely approximated number with a combined error of approximately 1 part in 2000 — an accuracy of a little more than 3 decimal places within the mantissa.

### Using the Math Estimate Instruction

The easiest way to access the math estimate instructions is through the intrinsic functions `__FRSQRTE` and `__FRES`. The square root estimate function uses a double parameter (double-precision floating-point number) while the reciprocal function uses a single-precision floating-point number. Both functions return a single-precision floating-point number. They both map closely to the CPU’s internal instructions, so switching to full inline assembler is not needed to use their full potential. Below is a sample vector magnitude function that demonstrates one possible use of the estimate functions:

```float magnitude (float x, float y, float z)
{
float magnitudeSquared = (x*x) + (y*y) + (z*z);
return __FRES( __FRSQRTE( magnitudeSquared ) );
}
```

The corresponding assembler output of the above function:

```magnitude(float, float, float):
fmuls f13, f2, f2
fmadds f0, f1, f1, f13
fmadds f13, f3, f3, f0
frsqrte f0, f13
frsp f13, f0
fres f1, f13
blr
```

Of interest, is the extra `frsp` instruction that is inserted after the `frsqrte` instruction. This is the Floating-Round to Single-Precision instruction. The `frsqrte` instruction works internally with double-precision values, but is guaranteed to return a single-precision value if one is passed into it3. This instruction is needed because the current intrinsic is accepting a double parameter, but even with this additional instruction, the overall execution is much more efficient than the software emulated version. To get around this added instruction, the GHS compiler version 5.3.19 includes a `__FRSQRTEF` function.

There are a couple of disadvantages to using this method beyond loss of precision. When computing the square root of 0, the above code returns `NAN` instead of the expected value of zero. This is because the estimate functions work off reciprocal values. As far as the estimate operation is concerned, 1 / 0 = NAN. When this number is fed into the reciprocal estimate function, NAN is returned again. If there is any chance that your data includes a zero, ensure that you check for this initial condition when you validate that the values are non-negative. One final detail to consider is that the estimate functions do not set the `errno` value that the C standard requires.

### PPC Math

While researching how the compiler handles square root functions, I came across the following in the GHS compiler `include/ppc` directory: `ppc_math.h`. This header file unlinks the standard library's `sqrt` and `sqrtf` functions and replaces them with Espresso chip optimized versions. It uses the math estimate instructions as a first estimate into a Newton's Square Root Approximation Method. Because this first guess is so accurate, the single-precision version needs only 3 loops to arrive at a final value, and the double-precision version loops a total of 5 times. These functions run much faster than the standard `sqrt` and `sqrtf` functions. I have included them in the data gathered below. There is also a built-in check for zero that prevents the NAN error that occurs with the bare estimate functions discussed above. If the input range is guaranteed not to include a zero, this check may be removed by defining `__GHS_NO_FAST_SQRT_ZERO_CHECK` before including the `math_ppc.h` file.

### Paired Singles

While these instructions are not compatible with paired-single math "as is", there are two instructions which accomplish approximately the same results: `ps_rsqrte` and `ps_res`. Both of these instructions have nearly the same accuracy error bounds and runtime performance as their single floating-point counterparts, though `ps_rsqrte` takes a single-precision float as its input instead of a double4. These two instructions should be used in situations where multiple square root values can be calculated in parallel because they increase overall performance.

## Test Case

To understand how much speed may be gained by using the math estimate functions, I set up each of the following test cases in `for` loops with 10,000 iterations. Each iteration of the `for` loop obtains a number from a large table of floating-point values, runs a square root method, and then stores the value in a global register. The following 8 test cases were set up and run once with the "`-Osize`" and once with the "`-Ospeed`" optimization levels:

1. A straight call to `sqrt`. This is a double-precision square root function. Using this function for single-precision math is overkill, but it is included for completeness.
2. A call to `sqrtf`. This is the correct function to call for single-precision results.
3. A call to the PPC-specific `sqrt` function. It uses hardware-specific knowledge to speed up execution.
4. A call to the PPC-specific `sqrtf` function. This single-precision function is faster than the generic library version.
5. A call to `__FRSQRTE` for the reciprocal square root and a regular floating-point divide to slightly increase the precision.
6. A call to both `__FRSQRTE` and `__FRES` after checking for zero to help prevent an incorrect NAN value from being generated.
7. A call to `__FRSQRTE` and `__FRES` for the best possible speed increase using intrinsic-level functions.
8. Same as 6, with an inline-assembly instruction in place of the `__FRSQRTE` call to remove the extra `frsp` instruction.
NOTE:
Adding inline-assembler is generally not recommended because it hamstrings the optimization capability of the compiler and can add hard-to-track errors. The compiler manual is specific about the risks of using inline-assembler5.
9. A single call to `fadd` to demonstrate how fast a simple FPU instruction is in comparison.

Running both speed and size optimized versions of the test demonstrates the differences that occur when the compiler can use the CPU’s superscalar abilities to pipeline away the loop overhead calls between the various instructions when loop-unrolling and instruction-reordering optimizations are allowed.

To isolate the time to that of just the math instructions, I subtracted the time required to read from the large float table and store that value back into a global register from the execution time.

### Test Case Build Instructions

The test case may be found at `\$CAFE_ROOT/system/src/demo/os/sqrt_test`.

To build the slower version, change the `CC_OPTIMIZATION_FLAGS` in `makefile` to the first value.

To build the faster version, change the `CC_OPTIMIZATION_FLAGS` in `makefile` to the second value.

#### To make and run the demo:

1. Run `cafe.bat`, and navigate to the `\$CAFE_ROOT/system/src/demo/os/sqrt_test` directory.
2. type: `make NDEBUG=TRUE run`

To test the PPC specific math files, remove the comments around `#include "ppc_math.h"` at the top of the source file.

NOTE:
If you change the build setting in `makefile`, touch the source file for a build to occur.

## Results

The test cases were run over 10,000 iterations. The following data indicates how much faster the estimate instructions are compared to the software emulated versions: Figure 1 – A comparison of the execution times in microseconds of seven ways to take
the square root on Wii U using a single floating-point add operation (
`fadd`). Note the
difference between the software emulated square root and the hardware estimate version.

As expected, there is a vast difference between using the estimate instruction and using the software emulated functions, especially when using the instructions in combination with "`-Ospeed`" rather than "`-Osize`". Interestingly, using the PPC math specific functions nearly doubled the performance of the basic square root. Still, the estimate functions are definitely the winners in terms of execution speed, even when checking for zero on input to prevent the NAN from being generated.

To magnify the results further, here is the same data normalized against the cost of a single `fadd` instruction. This provides a feel for the overall cost against a simple FPU operation. I have split the data into two charts because the differences between the size and speed optimizations are so dramatic. Figure 2 – Data from the size optimized build normalized against a simple FPU operation.
The estimate functions are running around 5x faster than the software emulated versions.

Each successive test case provides a speed increase, running an order of magnitude faster by the time our last two cases are tested. Next, the optimized for speed chart: Figure 3 – Data from the speed optimized build normalized against a simple FPU operation.
CPU pipelining allows the hardware estimation functions to run lightning fast.

The difference is even more dramatic in this case. The intrinsic-based function is running more than 30 times faster than the native square root, and a respectable 13 times faster than the PPC square root function. One last tidbit this chart offers is that using inline assembly actually decreased performance in the speed optimization pass. This is because the compiler cannot pipeline the CPU quite as effectively as the intrinsic functions. Remember, the compiler must respect the exact order and registers used in any inline assembly instruction blocks, which may interfere with the compiler's ability to effectively pipeline instructions for the CPU. The lesson here is that inline assembly is not always a win in terms of performance, even when decreasing the number of instructions.

## Summary

The Espresso CPU does not contain a full-precision square root instruction. Carefully choose your functions when doing any math that requires this functionality. If your algorithm needs full-precision square root functions but does not require `errno`, consider including the ppc_math libraries. However, if the algorithm can also handle a 1-part in 2000 precision loss, using the square root and reciprocal estimate functions wins back so much performance that they should not be ignored.

Back to 1Espresso Developer User Manual, page 39.
Back to 2Espresso Developer User Manual, page 39.
Back to 3Espresso Developer User Manual, page 337.
Back to 4Espresso Developer User Manual, page 215 and 216.
Back to 5Build_ppc.pdf in the GHS/Docs directory, pages 593-594.

## Revision History

2013/08/09 Convert PDF files to HTML.

CONFIDENTIAL