Paired Single Optimizations

To make the best use of the GHS compiler, math developers should be aware of the following:

Therefore, the following is advised:

Code Generation Examples

The following examples illustrate the suggested restrictions by describing the code generated in terms of lines of code and time.

Example 1 - Scalar Vector Class (baseline)

Let’s start with a basic example before adding complexity. Consider the C++ code to generate the cross product and return the magnitude of the resultant vector:

float ps_test1(float a1, float a2, float a3, float b1, float b2, float b3)
{
   ScalarVec a(a1, a2, a3, 1.0f);
   ScalarVec b(b1, b2, b3, 1.0f);
   ScalarVec c = a.cross(b);
   return c.d2();
}	

And you defined your vector class as follows:

class ScalarVec
{
	public:
		ScalarVec(float x_in, float y_in, float z_in, float w_in)
			: x(x_in), y(y_in), z(z_in), w(w_in)  {}
		ScalarVec cross(const ScalarVec &b) const
		{
			ScalarVec ret;
			const ScalarVec &a = *this; 
			ret.x = a.y*b.z - a.z*b.y;
			ret.y = -(a.x * b.z - a.z*b.x); 
			ret.z = a.x*b.y-a.y*b.x; 
			ret.w = 1;
			return ret;
		}
		float d2()  { return x*x + y*y + z*z;} 
		float x,y,z,w;
};	

The results of calculating the distance squared of the cross product is:

GHS
fmuls	f12, f3, f4
fmuls	f0, f2, f4
fmsubs	f13, f1,f6, f12
fmuls	f11, f3, f5
fneg	f8, f13
fmsubs	f10, f2,f6, f11
fmuls	f13, f8, f8
fmsubs	f12, f1,f5, f0
fmadds	f9, f10,f10, f13
fmadds	f1, f12,f12, f9
blr
Num Instructions: 11
Cycles per call: 28

The compiler has inlined the code, and used the multiply and add instructions.

Example 2 - Scalar Vector Class with a union (Poor choice)

So given Example 1, and suppose you are going to start vectorizing it, should you use a union to alias the scalar components? The idea of a union may occur to you if you are familiar with how gcc vectorization works (e.g. http://ds9a.nl/gcc-simd/example.html)

Answer: NO - This is a poor choice with the GHS compiler. Do not alias unions with scalars. If you are porting code from another platform that is currently performing this, you should consider changing it.

This can be demonstrated by just adding the union (without using it) to the code in Example 1:

class ScalarVecBad
{
	public:
		ScalarVecBad(float x_in, float y_in, float z_in, float w_in)
			: x(x_in), y(y_in), z(z_in), w(w_in) {}
		ScalarVecBad cross(const ScalarVecBad &b) const
		{
			ScalarVecBad ret; 
			const ScalarVecBad &a = *this; 
			ret.x = a.y*b.z - a.z*b.y;
			ret.y = -(a.x * b.z - a.z*b.x); 
			ret.z = a.x*b.y-a.y*b.x; 
			ret.w = 1;
			return ret;
		}		
		float d2()
		{
			return x*x + y*y + z*z;
		}
 
		union {
			struct {
				float x;
				float y;
				float z;
				float w;
			};
			struct {
			   __vec2x32float__ psa;
			   __vec2x32float__ psb;
			};
		};
};	

The results of calculating the distance squared of the cross product is now:

GHS
fmuls	f13, f3, f5
fmuls	f12, f3, f4
fmsubs	f10, f2,f6, f13
fmuls	f11, f2, f4
stwu	sp, -72(sp)
stfs	f10, 8(sp)
fmsubs	f7, f1,f6, f12
lwz	r0, 8(sp)
fmsubs	f12, f1,f5, f11
stw	r0, 56(sp)
fneg	f9, f7
stfs	f12, 16(sp)
lwz	r11, 16(sp)
stfs	f9, 12(sp)
lwz	r9, 12(sp)
stw	r9, 60(sp)
lfs	f11, 60(sp)
lfs	f13, 56(sp)
fmuls	f8, f11, f11
stw	r11, 64(sp)
lfs	f9, 64(sp)
fmadds	f10, f13,f13, f8
addi	sp, sp, 72
fmadds	f1, f9,f9, f10
blr
Num Instructions: 25
Cycles per call: 50

The number of instructions and cycles per call has roughly doubled for the same amount of work. That union has restricted GHS’s optimizer and the result is that a large amount of unnecessary stack read and writes get generated. Do not use unions with paired singles.

Example 3 - Paired Single Vector Class (baseline)

The previous examples were written using scalar code, but in a real system you will want a much richer vector class using paired single operations.

class PSVec
{
	public:
		inline f32x2 makeps(float a, float b) { f32x2 ret  = {a,b}; return ret; }
		PSVec(float x_in, float y_in, float z_in, float w_in)
		{	psa = ps(x_in, y_in);  psb = ps(z_in, w_in); }
		PSVec() {}
		PSVec cross(const PSVec &b) const
		{
			PSVec dst; 
			f32x2 fp0, fp1;
			f32x2 fp2 = {psb[0], psb[0]};
			f32x2 fp3 = {b.psb[0], b.psb[0]};
			f32x2 fp4, fp5, fp6, fp7, fp8, fp9, fp10;
			fp1 = b.psa;
			fp0 = psa;
			fp6 = __PS_MERGE10(fp1, fp1);
			fp4 = __PS_MUL(fp1, fp2);
			fp7 = __PS_MULS0(fp1, fp0);
			fp5 = __PS_MSUB(fp0, fp3, fp4);
			fp8 = __PS_MSUB(fp0, fp6, fp7);
			fp9 = __PS_MERGE11(fp5, fp5);
			fp10 = __PS_MERGE01(fp5, fp8);
			dst.psa = fp9;
			fp10 = __PS_NEG(fp10);
			dst.psb = fp10;
			return dst;
		}
 
		float d2()
		{
			__vec2x32float__ xxyy = __PS_MUL(psa, psa);
			__vec2x32float__ xxyy_zzww = __PS_MADD(psb,psb, xxyy);
			__vec2x32float__ sum = __PS_SUM0(xxyy_zzww, xxyy_zzww, xxyy);
			return sum[0];
		}
 
		__vec2x32float__ psa;
		__vec2x32float__ psb;
};	

The results of calculating the distance squared of the cross product is:

GHS
ps_merge00	f7, f4, f5
ps_merge00	f10, f1, f2
ps_muls0	f12, f7, f10
ps_merge10	f11, f7, f7
ps_msub	f9, f10, f11,f12
ps_merge00	f12, f6, f6
ps_merge00	f11, f3, f3
ps_mul	f0, f7, f11
ps_msub	f13, f10, f12,f0
ps_merge01	f8, f13, f9
ps_neg	f0, f8
ps_merge11	f13, f13, f13
ps_mul	f8, f13, f13
ps_madd	f9, f0, f0,f8
ps_sum0	f10, f9, f9,f8
fmr	f1, f10
blr
Num Instructions: 17
Cycles per call: 49

The generated code looks reasonable. The GHS compiler generated code looks short and clean.

Example 4 - Mixing Scalar with Vector code (bad idea)

Example 3 is acceptable because the C++ dot product and cross product code are written using the paired single intrinsics. But suppose we wrote the dot product as the following.

float d2()
{
   return psa[0]*psa[0]+psa[1]*psa[1]+psb[0]*psb[0]+psb[1]*psb[1];
}

The results of calculating the distance squared of the cross product is now:

GHS
(assembly not shown)
Num Instructions: 25
Cycles per call: 53

As you might expect, the mix of scalar (d2) and paired single (cross product) code generates lesser quality code than either purely paired-single or purely scalar code. So do not mix them.

Example 5 Autovectorization

Consider the following code to add 4 values at a time. This example was deliberately written to look like something that could be vectorized.

float add4values(float a1, float a2, float a3, float a4, float b1, float b2, float b3, float b4)
{
	for (int i = 0; i < 1000; i++)
	{
		a1 += b1;
		a2 += b2;
 
		a3 += b3;
		a4 += b4;
	}
	return a1+a2+a3+a4;
}

This code could be vectorized to use paired singles. The compiler performs the following:

GHS
li	r0, 100
	mtctr	r0
.L844:
	fadds	f1, f1, f5
	fadds	f2, f2, f6
	fadds	f4, f4, f8
	fadds	f1, f1, f5
	fadds	f2, f2, f6
	fadds	f4, f4, f8
	fadds	f1, f1, f5
	fadds	f3, f3, f7
	fadds	f4, f4, f8
	fadds	f1, f1, f5
	fadds	f2, f2, f6
	fadds	f3, f3, f7
	fadds	f2, f2, f6
	fadds	f4, f4, f8
	fadds	f3, f3, f7
	fadds	f1, f1, f5
	fadds	f3, f3, f7
	fadds	f4, f4, f8
	fadds	f2, f2, f6
	fadds	f3, f3, f7
	fadds	f4, f4, f8
	fadds	f2, f2, f6
	fadds	f3, f3, f7
	fadds	f4, f4, f8
	fadds	f1, f1, f5
	fadds	f2, f2, f6
	fadds	f4, f4, f8
	fadds	f3, f3, f7
	fadds	f2, f2, f6
	fadds	f1, f1, f5
	fadds	f3, f3, f7
	fadds	f4, f4, f8
	fadds	f2, f2, f6
	fadds	f1, f1, f5
	fadds	f3, f3, f7
	fadds	f2, f2, f6
	fadds	f1, f1, f5
	fadds	f4, f4, f8
	fadds	f3, f3, f7
	fadds	f1, f1, f5
	bdnz	.L844
	fadds	f0, f1, f2
	fadds	f13, f0, f3
	fadds	f1, f13, f4
	blr
Num Instructions: 47
Cycles per call: 4420

The GHS compiler doesn’t auto-vectorize, so it would never convert that code to paired single code. But it can compensate, with respect to performance, by loop unrolling.

Revision History

2014/06/24 Reworked external link.
2013/05/08 Automated cleanup pass.
2012/06/15 Initial version.


CONFIDENTIAL