Optimizing Math on the Espresso CPU

Using the paired singles and locked cache DMA features on the Wii U console to improve performance.

Games rely heavily on large numbers of floating-point math operations to create interactive 3D worlds. The Espresso CPU provides two features which dramatically improves performance:

Paired-Singles Functions

The 64-bit floating-point registers on the Espresso CPU may store a 64-bit double value or may disable half the register when storing a 32-bit single value. The Espresso floating-point registers have a Paired Single mode which enables one register to hold two independent 32-bit singles.


Figure 1: Each floating-point type has a different layout in the floating-point registers

Using Paired Singles Functions

The Green Hills Software (GHS) compiler provides paired single functions that may be called from C or C++ code. These include functions to load (__PSQ_L*), store (__PSQ_ST*), and a variety of math operations (__PS_*). Several functions perform multiple types of math operations with a single instruction, such as the __PS_MADD function, which performs a multiply and addition operation. The paired single C functions are intrinsic functions that compile down to a single assembly instruction. The paired single instructions are faster than the equivalent group of standard instructions and the reduced function count saves instruction cache space.

Creating Paired Singles Assembly

The paired single functions map closely to specific Espresso assembly instructions, which makes converting a function written with paired single functions simple to convert to assembly. Writing a paired single function in assembly helps control the order of instructions. The Espresso CPU has out-of-order processing which allows different logic units to execute instructions in parallel if there are no dependencies between the instructions. The instruction completion queue on the Espresso is only six instructions deep, so there is a limited range of instructions that may be in progress at any time. Interleaving paired single load/store instructions and math instructions takes advantage of the parallelism between the Espresso Load/Store Unit and Floating-Point Unit.

Using assembly is not always necessary. The example code demonstrates that the difference between hand tuned assembly and assembly created by the compiler is fairly small. Changes in GHS MULTI 5.3.12 and later further improve paired single code generation.

Improving Performance with Paired Singles

The example demo performs a 4x4 matrix multiply on 3200 matrices. A 4x4 matrix multiply requires 64 multiplies, 48 additions, 32 loads and 16 stores for a total of 160 instructions. This operation may be written with paired singles using 32 math operations, 16 loads and 8 stores for a total of 56 instructions.

Theoretically, the paired single matrix multiply is more than twice as fast as the standard multiply. In practice, the demo shows that the paired single logic runs only 13% faster than the standard multiply. Applying the optimization techniques of loop unrolling, interleaving loads with math operations to take advantage of instruction parallelism, and working on multiple matrices simultaneously results in another 11% improvement to the paired single logic. These results are still much lower than the theoretical gain from using paired singles. It appears that, when loading the matrix data from Mem2, the performance cost of data cache misses is overshadowing the performance improvements from using paired singles and other optimizations.

The memory latency concern is confirmed in a situation where the data is already in the L2 data cache and the paired single version runs 84% faster than the standard multiply. Applying loop unrolling and data pipelining makes the logic run an additional 27% faster for a total of 130% improvement over the standard multiply. This is much closer to the theoretical performance of the paired single operations.

In real world environments, it is unlikely that the necessary data will be in the normal caches when needed, which is why streaming data from memory using the DMA to the locked cache might improve performance.

Avoiding Pitfalls with Paired Singles

When writing C/C++ code with paired singles, a few steps may be taken to enable the compiler to generate optimal assembly code.

Using Locked Cache

The Espresso CPU has 32k of data cache and 32k of instruction cache. Both of these blocks of memory have extremely fast 1-cycle access times. The data cache is divided into two parts, 16k is used as a normal L1 data cache and 16k is used as a locked cache. Unlike previous Nintendo consoles, the locked cache is always enabled.



Figure 2: Data in memory can take multiple paths to get to and from a single CPU.

Using normal L1 and L2 cache is transparent to the programmer and happens automatically when reading or writing to memory. Data memory requests for addresses in main memory from the CPU first check if the values exist in L1 data cache. If the data is not present, the system checks L1 data cache of other cores, L2 data cache and L2 data cache of other cores. If data is not in any of the cache, the data is then loaded from main memory and added to the cache, potentially overwriting existing cached data. Since main memory is much slower than the speed of the CPU, using multiple levels of cache is faster than having the CPU read directly from memory when repeatedly accessing small amounts of data in a short amount of time.

Locked cache is a discrete chunk of high speed memory that does not automatically map into main memory like normal cache. Locked cache is accessible to the CPU via a unique address range and may be used as a high speed scratch pad. It may also be used with a DMA mechanism which quickly transfers data between main memory and the locked cache up to three times as fast as percolating the data from memory to the CPU through normal data cache. However, the process of using the DMA to transfer data to and from locked cache is not automatic like normal cache and must be explicitly coded into the program using the LCLoadDMABlocks and LCStoreDMABlocks functions.

This means that there are two paths that the same data in memory may use to get to the CPU: via normal cache or via DMA to locked cache. However, the data may exist in both paths and contain different values. Unlike normal data cache which performs interventions to enforce data coherency across cores, there are no mechanisms to synchronize data in normal cache with data in locked cache. The program must invalidate or flush data that might be in normal data cache before working with it in locked cache.

Processing the LCSTREAM Library

The LCSTREAM library in the SDK is provided as an example of streaming with DMA to and from locked cache while performing computation on the streamed data. The library demonstrates how to load data from Mem2 into locked cache using the DMA, perform work on it, and then write it back to memory using the DMA in a double-buffered process. The source code to the LCSTREAM library is available in the $CAFE_ROOT/system/src/lib/lcstream folder.

Improving Performance with Locked Cache DMA

Normal L1 data cache and locked cache are different parts of the same block of memory. There is no difference in performance between accessing data that is in L1 data cache and locked cache. However, in a situation where data does not exist in normal L1 and L2 cache or locked cache, loading the data from Mem2 through DMA transfer to locked cache is considerably faster than waiting for the memory to load from Mem2 through the L2 and L1 data cache.



Figure 3: Locked cache DMA performance is between cached and uncached performance.

The sample code found in the example project, verifies that using the LCSTREAM library to perform the paired single matrix-multiply results in logic that runs 165%-190% faster than the paired single multiply on the normal cache when the data is not already in either cache. However, loading data through the DMA to locked cache is slower than accessing data that is already loaded in normal L2 data cache. The normal precached paired single matrix-multiply runs 153%-190% faster than streaming with DMA to locked cache to perform matrix multiplies.

Since precached performance is better than streaming performance through locked cache, using locked cache with DMA is not always the best solution for accessing memory. DMA to locked cache is the best solution for an algorithm if it uses more data than can fit into L2 data cache, if the data is never in L2 data cache when needed or if it causes other, more important data to be pushed out of L2 data cache. When data is not cached, using DMA to locked cache is the fastest way to transfer data directly between Espresso and Mem2. Code that may benefit from using DMA streaming to locked cache has spikes in L2 data cache-misses. The Wii U CPU Profiler may be used to check for situations where data cache-misses effect performance.

Summary

Using paired singles to perform floating-point math operations may more than double floating-point performance. However, memory performance may be an obstacle to achieving peak performance. Locked cache may be used as a high speed scratch pad or for transferring data directly from main memory by using the DMA mechanism. When data does not already exist in cache, streaming data with DMA to locked cache is almost three times as fast as transferring data through normal data cache.

Appendix

The sample code in the example project demonstrates a matrix multiply of 3200 matrices using a combination of standard, paired single, and locked cache techniques.

Running the demo:

  1. In cafe.bat:
    1. Change the CAFE_ROOT variable to point to the correct directory for the SDK with the Wii U CPU Profiler installed.
    2. If Cygwin is not installed at C:\cygwin, change the CYGWIN_PATH variable in cafe.bat to point to the correct directory.
  2. Make sure the SDK and CAT-DEV are already configured according to the SDK guides.
  3. Double-click cafe.bat.
  4. At the command prompt, type cd $CAFE_ROOT/system/src/demo/os/espresso_math
  5. At the command prompt, type make run.
  6. The demo runs and prints one set of results.
    1. Press the A Button on the debug controller to run the test with a clear cache.
    2. Press the X Button to run the test with the matrix data precached.
    3. Press the B Button to exit.
  7. After exiting, at the command prompt, type cafestop to stop the CAT-DEV.

For each pass, lower numbers are faster for the microsecond timings of all tests that are printed. The Raw time is a C-style matrix multiply. The PS time uses paired single C functions. The ASM time uses paired single assembly. The ASMx2 result performs two independent matrix multiplies at the same time. The final tests run the same functions using DMA streaming to the locked cache.

If you comment out the #define OPTION_LOAD_INTERLEAVE on line 35, you disable interleaving load/store instructions with math instructions, which takes advantage of parallelism between the load/store unit and floating-point units in the CPU. If you comment out the #define OPTION_LOOP_UNROLL on line 37, you disable the loop unrolling optimizations, which perform multiple operations for each loop to reduce the relative overhead of the looping logic.

Revision History

2013/08/05 Convert PDF files to HTML.


CONFIDENTIAL