There are opportunities in the development process where even a small amount of optimization may make the difference between a game that is performant and one that is not. This topic focuses primarily on Paired Singles (PS) optimizations, ignoring other factors that may also affect performance, such as memory, caching, and data organization.
Three methods of optimization using Paired Singles are discussed: An initial C implementation, an inline-PS version, and, finally, a hand-coded assembly version to demonstrate that, while compiler generated code has come a long way, there is still a case for hand-crafted assembly code.
For additional optimization tips, see Optimizing Math on the Espresso CPU, which looks at memory and cache behavior in more depth, and other articles in Hints and Tips. For more information about how to use Paired Singles, see Paired Singles Optimizations.
The floating-point unit (FPU) on the Espresso CPU has 32 64-bit registers. Each register can hold one of three different data formats at any given time:
Each instruction for the FPU is specifically encoded to operate on one of the three supported formats. This enables the FPU to handle any of the data types for a given instruction without having to switch modes.
The FPU has a 3-stage pipeline for processing floating-point arithmetic. This means, in general, single-precision arithmetic instructions typically have a 1-cycle throughput and 3-cycle latency (multiply, add, normalize). Because the multiplication stage takes up to two cycles, double-precision arithmetic instructions may take twice as long, ranging from 2-cycle throughput to 4-cycle latency, respectively. In addition to these two data-types, a third data-type, Paired Singles, is available.
For more information, see page 29 in the IBM Espresso RISC Microprocessor Developer's User Manual on your local Nintendo software development support group website. Floating-point instruction latencies and throughputs are listed in Table 6-7, starting on page 214.
Paired Singles mode allows the FPU to operate on two separate 32-bit floating-point values at the same time, theoretically doubling the FPU performance. Because the CPU has 64-bit internal data paths, it can load a Paired Singles in a single cycle, reducing memory access time in comparison to cases where the program loads two 32-bit single-precision values into separate registers over the course of two instructions. Effectively, Paired Singles are 2-way SIMD single-precision, floating-point instructions. Factors that inhibit the 2x speed-up are discussed in the following section. Nevertheless, when used correctly, Paired Singles mode may make a significant difference in time-critical code.
There are various instructions that move single-precision data between the upper and lower fields of the 64-bit register, enabling interaction with Paired Singles registers without incurring a load/store penalty.
The test case is a straight-forward cloth simulation demo using Verlet integration. Optimizations are performed at the code level rather than the algorithmic level. The demo shows a textured mesh, representing a piece of cloth being affected by an external force (gravity).
The demo is located at
$CAFE_ROOT/system/src/demo/gx2/misc/pairedSingles. To run the demo:
CAFE_ROOTvariable to point to the correct directory for the SDK with the Wii U CPU Profiler installed.
C:\cygwin, change the
CYGWIN_PATHvariable to point to the correct directory.
With the controller, you may cycle between the standard-C, Paired Singles, and Assembly versions of both the integration and constraints functions. Each mode shows the amount of time taken in microseconds (µs). There are two parts to the algorithm: Integrating and constraining the cloth mesh. A Verlet algorithm is used to implement the integration part using the formula:
Where P is a 3-dimensional point, F is a 3D-vector force acting on each point, and Δt is a time constant.
Each point is connected horizontally and vertically to its neighbor by a spring-like constraint. Upon initialization, all constraints are considered at rest, which means that the distances between each of the points are the same. To simulate the mesh being held in place, the top two points are fixed. With each iteration, the points are affected by gravity and pushed back to their preferred location using the following formula:
Since there are multiple constraints, the constraints calculation is executed more than once.
The first function profiled is the Verlet integration function. In Figure 1, the function benefits reasonably by using Paired Singles, reducing the total time almost linearly from 17 µs (C) to 13 µs (inline-PS) to 11 µs (Assembly) for 24x24=576 points.
As a baseline, the theoretical minimal execution time for this function reduces to roughly 7.6µs, assuming that each instruction takes a single cycle, including loads and stores, at 16 instructions times 576 points. Since this function primarily uses the FPU, there is little opportunity to use other execution units.
A high-level review of the instruction make-up of the inner loops of all the implementations (see Figures 5-8 in the Appendix for disassembly) shows the following.
The first thing that immediately stands out is that the inline-PS and assembler versions have significantly fewer instructions than the C-version (far-right column). While, in theory, this is expected to result in a linear reduction in time as well, it does not seem to be the case. Obviously, there are other factors that affect performance.
The first function examined is the inline-PS and assembler function. The PS-load, -store and ALU instruction counts are identical, as is the branch instruction count. The only difference between the two functions seems to be the number of miscellaneous ALU instructions: 7 for the inline-PS version and only 1 for the assembler version. This increase is partly a result of not using the count register and setting up registers for address generation and constants. Note that all of these instructions are integer-based instructions and arrive finally in one of the IUs.
To get a rough estimate of the effects on performance, the following calculation is performed:
576 points (24x24) times an additional 6 instructions comes to a total number of 3456 instructions. At 1.2GHz, the theoretical time per instruction is 1/1.2e9=~834 picoSeconds1 or 0.834 nanoSeconds. Multiplying this out comes to an additional 2.77µs, which is somewhat in line when ignoring any potential gains made through parallel execution, branch prediction and cache hits. A closer look at the disassembly reveals that 2 or 3 instructions (add, compare, and branch) are potential candidates for parallel execution, bringing the difference closer to the observed 2µs.
The performance gain between the C and inline-PS versions is primarily a result of the reduced number of loads and stores. For the Verlet function, it takes a minimum of 6 single-precision floating-point loads (current and previous position) and 3 stores, resulting in a total of 9 accesses to memory. Conversely, the Paired Singles version requires only 4 loads and 2 stores, reducing the number of memory accesses by 33%.
Returning to the C-version, the combined number of loads and stores is more than double that of the inline-PS and assembler versions: 15 versus 6. Yet, while the total number of instructions is almost twice that of the inline-PS version (40 vs. 22), most of these do not appear to have much affect on the overall performance. Reviewing the Espresso Core Block Diagram on page 23 of the IBM Espresso RISC Microprocessor Developer's User Manual and the disassembly (Figure 5: Verlet C-version, 40 instructions in the Appendix), it becomes clear that the compiler attempted to interleave FPU and integer instructions to maximize parallel execution.
While the number of miscellaneous instructions is significantly greater than the inline-PS version, the extra overhead is masked by parallel execution. The reduction of integer instructions in the hand-coded assembler version results in an additional 2 µs gain.
Moving on, the constraints function shows a more interesting picture. First, the increase in performance between the C and inline-PS/assembler versions is much more pronounced, while the difference between the inline-PS and assembler versions is merely 100 µs over 10 iterations!
The C-version executes at roughly 115 µs/iteration, whereas the inline-PS and assembler versions execute at 51 µs and 44 µs respectively. Without taking memory access into account, an approximately 2x performance increase is obtained by switching to inline-PS / assembler.
sqrtffunction. This function alone is responsible for roughly 500 µs total time or 50 µs per iteration. While this does not necessarily relate to optimizing PS, it is interesting to point out that this could be optimized using the intrinsic reciprocal
The inner-loop instruction make-up reveals that there are many more differences in the constraints function compared to the previous Verlet function.
The difference between the number of instructions between the PS- and C-version is minimal: 54 versus 55 instructions. As with the previous function, there appears to be little correlation between the number of instructions and execution time.
Taking a look at the memory access patterns for inline-PS, a total of 6+4+5=15 memory accesses occurred. The C-version, on the other hand, has 14+6+2=22 memory accesses. Looking at the disassembly in Figure 8 and Figure 9, a much clearer picture emerges about the distribution and scheduling of instructions rather than the quantity. Most of the instructions in the C-version (Figure 8) are single-precision floating-point instructions, executed back-to-back. While the compiler does a good job of trying to manage dependencies between FP registers, there is little room for parallelization.
On the flip-side, the inline-PS and assembler versions show a much more balanced distribution between instructions, thereby allowing a more efficient use of the ALU, IU and FPU. In turn, the additional miscellaneous instructions execute in parallel and are essentially free.
This also explains the relatively small difference between the hand-optimized assembler version and the compiler generated inline-PS version. While the hand-optimized assembler version uses fewer instructions, the inline-PS version manages, to a certain degree, to hide the additional instruction execution time.
Using Paired Singles improves performance primarily because there are fewer load and store instructions. Secondary to that, fewer instructions are used, contributing to an overall improvement in performance.
Both functions benefit from using inline-PS and assembler, but in different ways. The Verlet function is small and may be reduced to a minimum in assembler. Due to the deterministic structure of the function, the gains are relatively small and consist solely of a reduction in the number of memory accesses.
On the surface, the constraints function performed even better, but the results were skewed as a result of an external call to the
sqrtf function. Because the constraints function is more complex than the Verlet function, there is also more room for parallelization.
There are many factors to consider when optimizing code. First and foremost, identify a bottleneck in the code and understand WHY it is a bottleneck. Often, restructuring the data and/or changing the algorithm results in larger performance increases than rewriting a function using inline-PS or assembler. In our example, aligning the data to a 4-word vector, with the 4th component zeroed out, allows easier loads and stores for Paired Singles.
Using Verlet over an nth-order Runge-Kutta for a function that does not need a tremendous amount of accuracy decreased the number of calculations and made the actual integration more stable.
When optimizing code for the Wii U, understand that the best way to optimize performance is to:
While reducing the number of instructions does make a difference for functions that are fixed and deterministic, the constraints function demonstrates that more complex functions may also create more opportunity for parallelization.
There are many factors that affect performance. Using Paired Singles may improve code density, and reduce the number of load and store instructions and ALU operations. NOT using Paired Singles may waste precious cycles. Combined with other optimization methods, such as using Locked Cache, Paired Singles may make a significant difference in the performance of your code.
Depending on the size of the functions, available time and familiarity with (any) Power-PC assembly, the developer must determine whether to use assembler or inline-PS. While assembler language can provide better results, it is generally more time consuming to write and maintain.
In conclusion, for a significant increase in performance, use Paired Singles where it makes sense, and avoid converting between register modes.
verletLoop: slwi r12, r8, 5 lis r7, %hiadj(force) addis r10, r12, %hiadj(points) addi r7, r7, %lo(force) addi r12, r10, %lo(points) lfs f8, 8(r7) lfs f9, 8(r12) lfs f12, 0x18(r12) fadds f0, f9, f9 lfs f11, 0(r12) lwz r0, 4(r12) lfs f13, 0x14(r12) stw r0, 0x14(r12) fsubs f7, f0, f12 lwz r0, 0xc(r12) lfs f10, 4(r12) fadds f12, f11, f11 lfs f0, 0x10(r12) fadds f6, f10, f10 lfs f10, 4(r7) fsubs f11, f12, f0 lfs f4, 0xc(r12) lfs f12, 0(r7) fsubs f13, f6, f13 lwz r11, 0(r12) stw r0, 0x1c(r12) lwz r10, 8(r12) addi r8, r8, 1 stw r11, 0x10(r12) fmadds f10, f10, f5, f13 stw r10, 0x18(r12) fmadds f9, f8, f5, f7 stfs f4, 0x14(sp) fmadds f11, f12, f5, f11 stfs f10, 4(r12) cmpwi r8, 0x240 stfs f9, 8(r12) stfs f4, 0xc(r12) stfs f11, 0(r12) blt verletLoop
Figure 5: Verlet C-version, 40 instructions
verletLoop: lis r11, %hiadj(points) slwi r0, r9, 5 addi r11, r11, %lo(points) add r12, r11, r0 addi r10, r12, 0x10 psq_l f6, 0(r12), 0, 0 psq_l f10, 0(r10), 0, 0 ps_add f11, f6, f6 ps_sub f11, f11, f10 psq_l f7, 8(r12), 0, 0 ps_madd f10, f13, f12, f11 psq_l f8, 8(r10), 0, 0 ps_add f9, f7, f7 ps_sub f9, f9, f8 ps_madd f11, f0, f12, f9 psq_st f10, 0(r12), 0, 0 psq_st f11, 8(r12), 0, 0 addi r9, r9, 1 psq_st f6, 0(r10), 0, 0 cmpwi r9, 0x240 psq_st f7, 8(r10), 0, 0 blt verletLoop
Figure 6: Verlet 'inline-PS' version, 22 instructions
verletLoop: psq_l f3, 0(r4), 0, 0 psq_l f4, 8(r4), 0, 0 psq_l f5, 0x10(r4), 0, 0 psq_l f6, 0x18(r4), 0, 0 ps_add f7, f3, f3 ps_add f8, f4, f4 ps_sub f7, f7, f5 ps_sub f8, f8, f6 ps_madd f7, f1, f0, f7 ps_madd f8, f2, f0, f8 psq_st f7, 0(r4), 0, 0 psq_st f8, 8(r4), 0, 0 psq_st f3, 0x10(r4), 0, 0 psq_st f4, 0x18(r4), 0, 0 addi r4, r4, 0x20 bdnz verletLoop
Figure 7: Verlet ASM-version, 16 instructions
ConstraintLoop: lis r12, %hiadj(constraints) slwi r0, r28, 3 subi r12, r12, %lo(constraints) add r12, r12, r0 lwz r0, 4(r12) lis r31, %hiadj(points) lwz r10, 0(r12) slwi r11, r0, 5 addi r31, r31, %lo(points) slwi r0, r10, 5 lfsx f9, r31, r11 lfsx f10, r31, r0 add r29, r31, r0 add r30, r31, r11 lfs f13, 8(r29) lfs f12, 8(r30) fsubs f29, f9, f10 lfs f0, 4(r29) fsubs f27, f12, f13 lfs f11, 4(r30) fmr f1, f29 fsubs f28, f11, f0 fmr f3, f27 fmr f2, f28 bl 0x2000e2c ; call to vecLength3 lis r9, %hiadj(.L967) lfs f11, %lo(.L967)(r9) fsubs f0, f1, f11 lis r10, %hiadj(.L1642) fdivs f7, f0, f1 lfs f12, %lo(.L1642)(r10) fmuls f7, f12, f7 fmuls f10, f29, f7 lfs f8, 0(r29) fmuls f9, f28, f7 lfs f11, 4(r29) fadds f8, f8, f10 lfs f12, 8(r29) fmuls f0, f27, f7 fadds f11, f11, f9 fadds f12, f12, f0 stfs f8, 0(r29) stfs f11, 4(r29) stfs f12, 8(r29) lfs f13, 0(r30) lfs f7, 4(r30) fsubs f13, f13, f10 lfs f8, 8(r30) fsubs f7, f7, f9 fsubs f8, f8, f0 stfs f13, 0(r30) addi r28, r28, 1 stfs f7, 4(r30) cmpwi r28, 0x450 stfs f8, 8(r30) blt 0x2001230
Figure 8: SatisfyConstraints C-version, 55 instructions
ConstraintLoop: slwi r0, r10, 3 lis r8, %hiadj(constraints) add r11, r8, r0 lwzu r8, %lo(constraints)(r11) lis r7, %hiadj(points) lwz r0, 4(r11) slwi r12, r8, 5 addi r7, r7, %lo(points) slwi r8, r0, 5 add r12, r7, r12 add r8, r7, r8 psq_l f2, 0(r12), 0, 0 psq_l f4, 0(r8), 0, 0 psq_l f5, 8(r8), 0, 0 ps_sub f6, f4, f2 psq_l f3, 8(r12), 0, 0 ps_mul f9, f6, f6 ps_sub f7, f5, f3 ps_mul f10, f7, f7 ps_sum0 f12, f9, f10, f9 lis r8, %hiadj(restLength) ps_add f0, f12, f10 addi r8, r8, %lo(restLength) ps_rsqrte f8, f0 psq_l f0, 0(r8), 0, 0 lis r8, %hiadj(half) ps_res f12, f8 addi r8, r8, %lo(half) ps_sub f9, f12, f0 psq_l f10, 0(r8), 0, 0 ps_muls0 f8, f9, f8 ps_muls0 f9, f8, f10 ps_muls0 f0, f6, f9 ps_add f8, f2, f0 psq_st f8, 0(r12), 0, 0 lwz r0, 0(r11) ps_muls0 f9, f7, f9 slwi r8, r0, 5 ps_add f10, f3, f9 add r8, r7, r8 psq_st f10, 8(r8), 0, 0 lwz r0, 4(r11) slwi r0, r0, 5 ps_sub f12, f4, f0 add r8, r7, r0 psq_st f12, 0(r8), 0, 0 lwz r0, 4(r11) addi r10, r10, 1 slwi r0, r0, 5 ps_sub f0, f5, f9 add r8, r7, r0 cmpwi r10, 0x450 psq_st f0, 8(r8), 0, 0 blt ConstraintLoop
Figure 9: SatisfyConstraints inline-PS version, 54 instructions
ASC_InnerLoop: lwz r6,0*4(r5) lwz r7,1*4(r5) slwi r6,r6,5 slwi r7,r7,5 psq_lux fp2,r6,r3,0,0 psq_l fp3,8(r6),0,0 psq_lux fp4,r7,r3,0,0 psq_l fp5,8(r7),0,0 ps_sub fp6,fp4,fp2 ps_sub fp7,fp5,fp3 ps_mul fp8,fp6,fp6 ps_mul fp9,fp7,fp7 ps_sum0 fp8,fp8,fp9,fp8 ps_add fp8,fp8,fp9 ps_rsqrte fp8,fp8 ps_res fp9,fp8 ps_sub fp9,fp9,fp1 ps_muls0 fp9,fp9,fp8 ps_muls0 fp9,fp9,fp0 ps_muls0 fp6,fp6,fp9 ps_muls0 fp7,fp7,fp9 ps_add fp2,fp2,fp6 ps_add fp3,fp3,fp7 ps_sub fp4,fp4,fp6 ps_sub fp5,fp5,fp7 psq_st fp2,0(r6),0,0 psq_st fp3,8(r6),0,0 psq_st fp4,0(r7),0,0 psq_st fp5,8(r7),0,0 addi r5,r5,2*4 bdnz ASC_InnerLoop
Figure 10: SatisfyConstraints ASM-version, 31 instructions
2013/11/01 Convert PDF files to HTML.