Optimizing Performance with Paired Singles

There are opportunities in the development process where even a small amount of optimization may make the difference between a game that is performant and one that is not. This topic focuses primarily on Paired Singles (PS) optimizations, ignoring other factors that may also affect performance, such as memory, caching, and data organization.

Three methods of optimization using Paired Singles are discussed: An initial C implementation, an inline-PS version, and, finally, a hand-coded assembly version to demonstrate that, while compiler generated code has come a long way, there is still a case for hand-crafted assembly code.

For additional optimization tips, see Optimizing Math on the Espresso CPU, which looks at memory and cache behavior in more depth, and other articles in Hints and Tips. For more information about how to use Paired Singles, see Paired Singles Optimizations.

Background

The floating-point unit (FPU) on the Espresso CPU has 32 64-bit registers. Each register can hold one of three different data formats at any given time:

Each instruction for the FPU is specifically encoded to operate on one of the three supported formats. This enables the FPU to handle any of the data types for a given instruction without having to switch modes.

The FPU has a 3-stage pipeline for processing floating-point arithmetic. This means, in general, single-precision arithmetic instructions typically have a 1-cycle throughput and 3-cycle latency (multiply, add, normalize). Because the multiplication stage takes up to two cycles, double-precision arithmetic instructions may take twice as long, ranging from 2-cycle throughput to 4-cycle latency, respectively. In addition to these two data-types, a third data-type, Paired Singles, is available.

For more information, see page 29 in the IBM Espresso RISC Microprocessor Developer's User Manual on your local Nintendo software development support group website. Floating-point instruction latencies and throughputs are listed in Table 6-7, starting on page 214.

Paired Singles mode allows the FPU to operate on two separate 32-bit floating-point values at the same time, theoretically doubling the FPU performance. Because the CPU has 64-bit internal data paths, it can load a Paired Singles in a single cycle, reducing memory access time in comparison to cases where the program loads two 32-bit single-precision values into separate registers over the course of two instructions. Effectively, Paired Singles are 2-way SIMD single-precision, floating-point instructions. Factors that inhibit the 2x speed-up are discussed in the following section. Nevertheless, when used correctly, Paired Singles mode may make a significant difference in time-critical code.

There are various instructions that move single-precision data between the upper and lower fields of the 64-bit register, enabling interaction with Paired Singles registers without incurring a load/store penalty.

Test Case

The test case is a straight-forward cloth simulation demo using Verlet integration. Optimizations are performed at the code level rather than the algorithmic level. The demo shows a textured mesh, representing a piece of cloth being affected by an external force (gravity).

The demo is located at $CAFE_ROOT/system/src/demo/gx2/misc/pairedSingles. To run the demo:

  1. Edit cafe.bat as follows:
    1. Change the CAFE_ROOT variable to point to the correct directory for the SDK with the Wii U CPU Profiler installed.
    2. If Cygwin is not installed at C:\cygwin, change the CYGWIN_PATH variable to point to the correct directory.
  2. Double-click cafe.bat.
  3. At the command prompt, type cd $CAFE_ROOT/system/src/demo/gx2/misc/pairedSingles/.
  4. At the command prompt, type make run.
  5. The demo runs and outputs the scene to the TV:
    1. Press the A Button to reset the simulation.
    2. Press the B Button to cycle between the different versions.

With the controller, you may cycle between the standard-C, Paired Singles, and Assembly versions of both the integration and constraints functions. Each mode shows the amount of time taken in microseconds (µs). There are two parts to the algorithm: Integrating and constraining the cloth mesh. A Verlet algorithm is used to implement the integration part using the formula:

Where P is a 3-dimensional point, F is a 3D-vector force acting on each point, and Δt is a time constant.

Each point is connected horizontally and vertically to its neighbor by a spring-like constraint. Upon initialization, all constraints are considered at rest, which means that the distances between each of the points are the same. To simulate the mesh being held in place, the top two points are fixed. With each iteration, the points are affected by gravity and pushed back to their preferred location using the following formula:

Since there are multiple constraints, the constraints calculation is executed more than once.

Initial Observations

The first function profiled is the Verlet integration function. In Figure 1, the function benefits reasonably by using Paired Singles, reducing the total time almost linearly from 17 µs (C) to 13 µs (inline-PS) to 11 µs (Assembly) for 24x24=576 points.

As a baseline, the theoretical minimal execution time for this function reduces to roughly 7.6µs, assuming that each instruction takes a single cycle, including loads and stores, at 16 instructions times 576 points. Since this function primarily uses the FPU, there is little opportunity to use other execution units.


Figure 1: Overall performance timing for Verlet (µs)

A high-level review of the instruction make-up of the inner loops of all the implementations (see Figures 5-8 in the Appendix for disassembly) shows the following.


Figure 2: Verlet inner-loop instruction make-up (C,PS,ASM)

The first thing that immediately stands out is that the inline-PS and assembler versions have significantly fewer instructions than the C-version (far-right column). While, in theory, this is expected to result in a linear reduction in time as well, it does not seem to be the case. Obviously, there are other factors that affect performance.

The first function examined is the inline-PS and assembler function. The PS-load, -store and ALU instruction counts are identical, as is the branch instruction count. The only difference between the two functions seems to be the number of miscellaneous ALU instructions: 7 for the inline-PS version and only 1 for the assembler version. This increase is partly a result of not using the count register and setting up registers for address generation and constants. Note that all of these instructions are integer-based instructions and arrive finally in one of the IUs.

To get a rough estimate of the effects on performance, the following calculation is performed:

576 points (24x24) times an additional 6 instructions comes to a total number of 3456 instructions. At 1.2GHz, the theoretical time per instruction is 1/1.2e9=~834 picoSeconds1 or 0.834 nanoSeconds. Multiplying this out comes to an additional 2.77µs, which is somewhat in line when ignoring any potential gains made through parallel execution, branch prediction and cache hits. A closer look at the disassembly reveals that 2 or 3 instructions (add, compare, and branch) are potential candidates for parallel execution, bringing the difference closer to the observed 2µs.

The performance gain between the C and inline-PS versions is primarily a result of the reduced number of loads and stores. For the Verlet function, it takes a minimum of 6 single-precision floating-point loads (current and previous position) and 3 stores, resulting in a total of 9 accesses to memory. Conversely, the Paired Singles version requires only 4 loads and 2 stores, reducing the number of memory accesses by 33%.

Returning to the C-version, the combined number of loads and stores is more than double that of the inline-PS and assembler versions: 15 versus 6. Yet, while the total number of instructions is almost twice that of the inline-PS version (40 vs. 22), most of these do not appear to have much affect on the overall performance. Reviewing the Espresso Core Block Diagram on page 23 of the IBM Espresso RISC Microprocessor Developer's User Manual and the disassembly (Figure 5: Verlet C-version, 40 instructions in the Appendix), it becomes clear that the compiler attempted to interleave FPU and integer instructions to maximize parallel execution.

While the number of miscellaneous instructions is significantly greater than the inline-PS version, the extra overhead is masked by parallel execution. The reduction of integer instructions in the hand-coded assembler version results in an additional 2 µs gain.

Moving on, the constraints function shows a more interesting picture. First, the increase in performance between the C and inline-PS/assembler versions is much more pronounced, while the difference between the inline-PS and assembler versions is merely 100 µs over 10 iterations!


Figure 3: Overall performance timing for constraints (µs) over 10 iterations

The C-version executes at roughly 115 µs/iteration, whereas the inline-PS and assembler versions execute at 51 µs and 44 µs respectively. Without taking memory access into account, an approximately 2x performance increase is obtained by switching to inline-PS / assembler.

NOTE:
Unfortunately, the reason for this improvement has little to do with the fact that Paired Singles are used. The disassembled C-version reveals a branch to the sqrtf function. This function alone is responsible for roughly 500 µs total time or 50 µs per iteration. While this does not necessarily relate to optimizing PS, it is interesting to point out that this could be optimized using the intrinsic reciprocal _FRSQRTE followed by _FRES.

The inner-loop instruction make-up reveals that there are many more differences in the constraints function compared to the previous Verlet function.


Figure 4: Constraints inner-loop instruction make-up

The difference between the number of instructions between the PS- and C-version is minimal: 54 versus 55 instructions. As with the previous function, there appears to be little correlation between the number of instructions and execution time.

Taking a look at the memory access patterns for inline-PS, a total of 6+4+5=15 memory accesses occurred. The C-version, on the other hand, has 14+6+2=22 memory accesses. Looking at the disassembly in Figure 8 and Figure 9, a much clearer picture emerges about the distribution and scheduling of instructions rather than the quantity. Most of the instructions in the C-version (Figure 8) are single-precision floating-point instructions, executed back-to-back. While the compiler does a good job of trying to manage dependencies between FP registers, there is little room for parallelization.

On the flip-side, the inline-PS and assembler versions show a much more balanced distribution between instructions, thereby allowing a more efficient use of the ALU, IU and FPU. In turn, the additional miscellaneous instructions execute in parallel and are essentially free.

This also explains the relatively small difference between the hand-optimized assembler version and the compiler generated inline-PS version. While the hand-optimized assembler version uses fewer instructions, the inline-PS version manages, to a certain degree, to hide the additional instruction execution time.

Analysis

Using Paired Singles improves performance primarily because there are fewer load and store instructions. Secondary to that, fewer instructions are used, contributing to an overall improvement in performance.

Both functions benefit from using inline-PS and assembler, but in different ways. The Verlet function is small and may be reduced to a minimum in assembler. Due to the deterministic structure of the function, the gains are relatively small and consist solely of a reduction in the number of memory accesses.

On the surface, the constraints function performed even better, but the results were skewed as a result of an external call to the sqrtf function. Because the constraints function is more complex than the Verlet function, there is also more room for parallelization.

There are many factors to consider when optimizing code. First and foremost, identify a bottleneck in the code and understand WHY it is a bottleneck. Often, restructuring the data and/or changing the algorithm results in larger performance increases than rewriting a function using inline-PS or assembler. In our example, aligning the data to a 4-word vector, with the 4th component zeroed out, allows easier loads and stores for Paired Singles.

Using Verlet over an nth-order Runge-Kutta for a function that does not need a tremendous amount of accuracy decreased the number of calculations and made the actual integration more stable.

When optimizing code for the Wii U, understand that the best way to optimize performance is to:

  1. Reduce the number of memory accesses (PS helps here) and ensure they do not span multiple cache lines.
  2. Organize the code and data to maximize parallel execution (IU/FPU/BU/memory/etc.)
  3. Reduce the number of instructions where possible.

While reducing the number of instructions does make a difference for functions that are fixed and deterministic, the constraints function demonstrates that more complex functions may also create more opportunity for parallelization.

Summary

There are many factors that affect performance. Using Paired Singles may improve code density, and reduce the number of load and store instructions and ALU operations. NOT using Paired Singles may waste precious cycles. Combined with other optimization methods, such as using Locked Cache, Paired Singles may make a significant difference in the performance of your code.

Depending on the size of the functions, available time and familiarity with (any) Power-PC assembly, the developer must determine whether to use assembler or inline-PS. While assembler language can provide better results, it is generally more time consuming to write and maintain.

In conclusion, for a significant increase in performance, use Paired Singles where it makes sense, and avoid converting between register modes.

Appendix


verletLoop:
	slwi      	r12, r8, 5							
	lis       	r7, %hiadj(force)
	addis     	r10, r12, %hiadj(points)
	addi      	r7, r7, %lo(force)
	addi      	r12, r10, %lo(points)
	lfs       	f8, 8(r7)
	lfs       	f9, 8(r12)
	lfs       	f12, 0x18(r12)
	fadds     	f0, f9, f9
	lfs       	f11, 0(r12)
	lwz       	r0, 4(r12)
	lfs       	f13, 0x14(r12)
	stw       	r0, 0x14(r12)
	fsubs     	f7, f0, f12
	lwz       	r0, 0xc(r12)
	lfs       	f10, 4(r12)
	fadds     	f12, f11, f11
	lfs       	f0, 0x10(r12)
	fadds     	f6, f10, f10
	lfs       	f10, 4(r7)
	fsubs     	f11, f12, f0
	lfs       	f4, 0xc(r12)
	lfs       	f12, 0(r7)
	fsubs     	f13, f6, f13
	lwz       	r11, 0(r12)
	stw       	r0, 0x1c(r12)
	lwz       	r10, 8(r12)
	addi      	r8, r8, 1
	stw       	r11, 0x10(r12)
	fmadds    	f10, f10, f5, f13
	stw       	r10, 0x18(r12)
	fmadds    	f9, f8, f5, f7
	stfs      	f4, 0x14(sp)
	fmadds    	f11, f12, f5, f11
	stfs      	f10, 4(r12)
	cmpwi     	r8, 0x240
	stfs      	f9, 8(r12)
	stfs      	f4, 0xc(r12)
	stfs      	f11, 0(r12)
	blt       	verletLoop

Figure 5: Verlet C-version, 40 instructions


verletLoop:
	lis       	r11, %hiadj(points)						
	slwi      	r0, r9, 5
	addi      	r11, r11, %lo(points)
	add       	r12, r11, r0
	addi      	r10, r12, 0x10
	psq_l     	f6, 0(r12), 0, 0
	psq_l     	f10, 0(r10), 0, 0
	ps_add    	f11, f6, f6
	ps_sub    	f11, f11, f10
	psq_l     	f7, 8(r12), 0, 0
	ps_madd   	f10, f13, f12, f11
	psq_l     	f8, 8(r10), 0, 0
	ps_add    	f9, f7, f7
	ps_sub    	f9, f9, f8
	ps_madd   	f11, f0, f12, f9
	psq_st    	f10, 0(r12), 0, 0
	psq_st    	f11, 8(r12), 0, 0
	addi      	r9, r9, 1
	psq_st    	f6, 0(r10), 0, 0
	cmpwi     	r9, 0x240
	psq_st    	f7, 8(r10), 0, 0
	blt       	verletLoop

Figure 6: Verlet 'inline-PS' version, 22 instructions


verletLoop:
	psq_l     	f3, 0(r4), 0, 0							
	psq_l     	f4, 8(r4), 0, 0
	psq_l     	f5, 0x10(r4), 0, 0
	psq_l     	f6, 0x18(r4), 0, 0
	ps_add    	f7, f3, f3
	ps_add    	f8, f4, f4
	ps_sub    	f7, f7, f5
	ps_sub    	f8, f8, f6
	ps_madd   	f7, f1, f0, f7
	ps_madd   	f8, f2, f0, f8
	psq_st    	f7, 0(r4), 0, 0
	psq_st    	f8, 8(r4), 0, 0
	psq_st    	f3, 0x10(r4), 0, 0
	psq_st    	f4, 0x18(r4), 0, 0
	addi      	r4, r4, 0x20
	bdnz      	verletLoop

Figure 7: Verlet ASM-version, 16 instructions


ConstraintLoop:
	lis       	r12, %hiadj(constraints)					
	slwi      	r0, r28, 3
	subi      	r12, r12, %lo(constraints)
	add       	r12, r12, r0
	lwz       	r0, 4(r12)
	lis       	r31, %hiadj(points)
	lwz       	r10, 0(r12)
	slwi      	r11, r0, 5
	addi      	r31, r31, %lo(points)
	slwi      	r0, r10, 5
	lfsx      	f9, r31, r11
	lfsx      	f10, r31, r0
	add       	r29, r31, r0
	add       	r30, r31, r11
	lfs       	f13, 8(r29)
	lfs       	f12, 8(r30)
	fsubs     	f29, f9, f10
	lfs       	f0, 4(r29)
	fsubs     	f27, f12, f13
	lfs       	f11, 4(r30)
	fmr       	f1, f29
	fsubs     	f28, f11, f0
	fmr       	f3, f27
	fmr       	f2, f28
	bl        	0x2000e2c            ; call to vecLength3
	lis       	r9, %hiadj(.L967)
	lfs       	f11, %lo(.L967)(r9)
	fsubs     	f0, f1, f11
	lis       	r10, %hiadj(.L1642)
	fdivs     	f7, f0, f1
	lfs       	f12, %lo(.L1642)(r10)
	fmuls     	f7, f12, f7
	fmuls     	f10, f29, f7
	lfs       	f8, 0(r29)
	fmuls     	f9, f28, f7
	lfs       	f11, 4(r29)
	fadds     	f8, f8, f10
	lfs       	f12, 8(r29)
	fmuls     	f0, f27, f7
	fadds     	f11, f11, f9
	fadds     	f12, f12, f0
	stfs      	f8, 0(r29)
	stfs      	f11, 4(r29)
	stfs      	f12, 8(r29)
	lfs       	f13, 0(r30)
	lfs       	f7, 4(r30)
	fsubs     	f13, f13, f10
	lfs       	f8, 8(r30)
	fsubs     	f7, f7, f9
	fsubs     	f8, f8, f0
	stfs      	f13, 0(r30)
	addi      	r28, r28, 1
	stfs      	f7, 4(r30)
	cmpwi     	r28, 0x450
	stfs      	f8, 8(r30)
	blt       	0x2001230

Figure 8: SatisfyConstraints C-version, 55 instructions


ConstraintLoop:
	slwi      	r0, r10, 3						    	
	lis       	r8, %hiadj(constraints)
	add       	r11, r8, r0
	lwzu      	r8, %lo(constraints)(r11)
	lis       	r7, %hiadj(points)
	lwz       	r0, 4(r11)
	slwi      	r12, r8, 5
	addi      	r7, r7, %lo(points)
	slwi      	r8, r0, 5
	add       	r12, r7, r12
	add       	r8, r7, r8
	psq_l     	f2, 0(r12), 0, 0
	psq_l     	f4, 0(r8), 0, 0
	psq_l     	f5, 8(r8), 0, 0
	ps_sub    	f6, f4, f2
	psq_l     	f3, 8(r12), 0, 0
	ps_mul    	f9, f6, f6
	ps_sub    	f7, f5, f3
	ps_mul    	f10, f7, f7
	ps_sum0   	f12, f9, f10, f9
	lis       	r8, %hiadj(restLength)
	ps_add    	f0, f12, f10
	addi      	r8, r8, %lo(restLength)
	ps_rsqrte 	f8, f0
	psq_l     	f0, 0(r8), 0, 0
	lis       	r8, %hiadj(half)
	ps_res    	f12, f8
	addi      	r8, r8, %lo(half)
	ps_sub    	f9, f12, f0
	psq_l     	f10, 0(r8), 0, 0
	ps_muls0  	f8, f9, f8
	ps_muls0  	f9, f8, f10
	ps_muls0  	f0, f6, f9
	ps_add    	f8, f2, f0
	psq_st    	f8, 0(r12), 0, 0
	lwz       	r0, 0(r11)
	ps_muls0  	f9, f7, f9
	slwi      	r8, r0, 5
	ps_add    	f10, f3, f9
	add       	r8, r7, r8
	psq_st    	f10, 8(r8), 0, 0
	lwz       	r0, 4(r11)
	slwi      	r0, r0, 5
	ps_sub    	f12, f4, f0
	add       	r8, r7, r0
	psq_st    	f12, 0(r8), 0, 0
	lwz       	r0, 4(r11)
	addi      	r10, r10, 1
	slwi      	r0, r0, 5
	ps_sub    	f0, f5, f9
	add       	r8, r7, r0
	cmpwi     	r10, 0x450
	psq_st    	f0, 8(r8), 0, 0
	blt       	ConstraintLoop
	

Figure 9: SatisfyConstraints inline-PS version, 54 instructions


ASC_InnerLoop:
	lwz		r6,0*4(r5)							
	lwz		r7,1*4(r5)
	slwi		r6,r6,5
	slwi		r7,r7,5
	psq_lux		fp2,r6,r3,0,0
	psq_l		fp3,8(r6),0,0
	psq_lux		fp4,r7,r3,0,0
	psq_l		fp5,8(r7),0,0
	ps_sub		fp6,fp4,fp2
	ps_sub		fp7,fp5,fp3
	ps_mul		fp8,fp6,fp6
	ps_mul		fp9,fp7,fp7
	ps_sum0		fp8,fp8,fp9,fp8
	ps_add		fp8,fp8,fp9
	ps_rsqrte	fp8,fp8
	ps_res		fp9,fp8
	ps_sub		fp9,fp9,fp1
	ps_muls0	fp9,fp9,fp8
	ps_muls0	fp9,fp9,fp0
	ps_muls0	fp6,fp6,fp9
	ps_muls0	fp7,fp7,fp9
	ps_add		fp2,fp2,fp6
	ps_add		fp3,fp3,fp7
	ps_sub		fp4,fp4,fp6
	ps_sub		fp5,fp5,fp7
	psq_st		fp2,0(r6),0,0
	psq_st		fp3,8(r6),0,0
	psq_st		fp4,0(r7),0,0
	psq_st		fp5,8(r7),0,0
	addi		r5,r5,2*4
	bdnz       	ASC_InnerLoop
	

Figure 10: SatisfyConstraints ASM-version, 31 instructions



1Does not include the possibility of dispatching 2 instructions at a time.

Revision History

2013/11/01 Convert PDF files to HTML.


CONFIDENTIAL