Convert Linear Textures to Tiled Textures
Performance Considerations

Summary

When working with textures on the Wii U, it is ideal to have the textures in a tiled1 format for efficiency reasons. However, this raises the following two questions:

Question 1: What is the most efficient way to convert linear textures into the tiled format?
Question 2: What is the most efficient way to modify tiled textures during runtime?

Converting Linear Textures to Tiled Format

The ideal case is to pre-tile the textures offline. However, if this is not convenient or possible, you may tile the texture at runtime. If your textures are linear aligned as defined by the GX2 documentation, GX2CopySurface is the fastest way to convert linear textures to tiled textures. If your textures are not linear aligned, you must convert your textures specially. For linear special texture, there are three conversion options:

Option 1: Use the slow GX2CopySurface on the linear special texture.
Option 2: Use the hardware tiling aperture feature to convert the texture.
Option 3: Use the CPU to copy to a linear aligned texture so that the fast GX2CopySurface may be used.

Given these three options, it is faster to use the second option for small textures and the third option for medium to large textures. The reasoning and evidence is provided here.

Modifying Tiled Textures During Runtime

There are two options for modifying an already tiled texture at runtime:

Option 1: Use the hardware tiling aperture feature to directly modify the tiled texture.
Option 2: Modify a linear copy of the texture and reconvert it to the tiled format.

For large modifications, the second option is the clear winner. The reasoning and evidence is provided here.

The Hardware Tiling Aperture Feature

To facilitate access to tiled textures, the Wii U has up to 30 hardware tiling apertures2 available. The limit on the amount of memory that may be addressed is 256 MB to 512 MB. For more information about using the hardware tiling aperture feature, see GX2 Texture APIs.

Tests Performed

Each test uses a highly modified version of the tileAperture demo. For the test, we deal with the texture linearly because this is the most likely scenario in which a linear texture is converted to a tiled texture. For specialized versions of converting or creating textures, alternative algorithms may be more efficient.

Two additional major changes were made to the demo:

Each test was performed by updating textures with sizes ranging from 64x64 pixels to 1024x1024 pixels, at 32-bits per pixel (RGBA8 format). For each test, the following metrics were collected:

The aggregate of these times is the total time required to create or modify a tiled texture in the given test.

For a closer to “real world” example, the cache is dirtied in each loop iteration using a buffer that is the size of the cache, and the instruction cache is invalidated. This simulates other parts of an application using the CPU.

NOTE:
SDK 2.09.11 was used to generate the data for this topic.

Copy-and-Tile Using GX2CopySurface on a Linear Aligned Texture

This test provides a baseline for comparing the alternative methods presented here. If the textures are linear aligned, use GX2CopySurface because it offers the best performance. Figure 1 shows the results of this test.

GX2CopySurface(linearTex, tiledTex);
GX2DrawDone();


Figure 1. Time to copy-and-tile a linear aligned texture using GX2CopySurface.

Copy-and-Tile Using GX2CopySurface on a Linear Special Texture

This test demonstrates why it is beneficial to look at the other options when starting with a linear special texture. As seen in Figure 2, GX2CopySurface for a linear special texture is slow.

GX2CopySurface(linearTex, tiledTex);
GX2DrawDone();


Figure 2. Time to copy-and-tile a linear special texture using GX2CopySurface.

Convert Using the Tiling Aperture

This test demonstrates how long it takes to write data using the hardware tiling aperture. The test is optimized to provide the most efficient algorithm available.

// Tiling Aperture Convert
for (u32 y = 0; y < texHeight; ++y)
{
     const u32 tiledLineStart = y * tiledPitch;
     const u32 specialLineStart = y * specialPitch;

     for (u32 x = 0; x < texWidth; x += 8)
     {
	  u32 *writePtr = &tiledTexAddr[tiledLineStart + x];
	  u32 *readPtr = &linearSpecialAddr[specialLineStart + x];
    
	  writePtr[0] = readPtr[0];
	  writePtr[1] = readPtr[1];
	  writePtr[2] = readPtr[2];
	  writePtr[3] = readPtr[3];
	  writePtr[4] = readPtr[4];
	  writePtr[5] = readPtr[5];
	  writePtr[6] = readPtr[6];
	  writePtr[7] = readPtr[7];
     }
}

GX2Invalidate(CPU, tiledTex);
GX2Invalidate(TEXTURE, tiledTex);

Figure 3 shows the results for writing to a texture using the tiling aperture. Since the aperture uses uncached memory, clearing or flushing of the data-cache is not required, resulting in linear growth as the size of the texture increases.


Figure 3. Conversion time using the tiling aperture.

Convert Using the CPU

This test demonstrates how long it takes to write data without using the hardware tiling aperture. The CPU performs the conversion from linear special to linear aligned, and then uses the GPU to convert the new linear aligned texture to a tiled texture.

// CPU Convert
for (u32 y = 0; y < texHeight; ++y)
{
    const u32 alignedLineStart = y * alignedPitch;
    const u32 specialLineStart = y * specialPitch;

    OSBlockMove(
	&alignedTexAddr[alignedLineStart],
	&specialTexAddr[specialLineStart],
	texWidth,
	FALSE);
}

DCFlushRange(texAddr, texWidth * texHeight * 4);
GX2CopySurface(alignedTex, tiledTex);
GX2DrawDone();

Figure 4 shows the results of this test. The tiling aperture was always faster than the CPU version. However, depending on your cache usage this outcome could change. In our testing the entire cache is marked Modified. This creates extra cache pressure for the CPU conversion method to flush the contents of each cache line before writing the new value. If the entire cache is in an Exclusive or Invalid state instead, the CPU conversion may be quicker than the tiling aperture.


Figure 4. Conversion using CPU including DCFlushRange and GX2CopySurface.

Convert Using Locked Cache

In this test, a more optimized and efficient version is created that does not use the tiling aperture. The locked cache was used to see what speed improvements we could get. Additional notes on our locked cache implementation are provided as comments in the source code.

// Locked Cache Convert
u32 *lcLines = (u32*)LCAlloc(1024 * 4);
const transferSize = ((texWidth * 4) / 32) & 127;

for (u32 y = 0; y < texHeight; ++y)
{
    const u32 alignedLineStart = y * alignedPitch;
    const u32 specialLineStart = y * specialPitch;

    LCLoadDMABlocks(
	lcLines,
	&specialTexAddr[specialLineStart],
	transferSize);

    LCWaitDMAQueue(0);

    LCStoreDMABlocks(
	&alignedTexAddr[alignedLineStart],
	lcLines,
	transferSize);

    LCWaitDMAQueue(0);
}

LCDealloc(lcLines);
GX2CopySurface(alignedTex, tiledTex);
GX2DrawDone();

Figure 5 shows the results of the test. When compared with the other implementation options, the locked cache actually performs the worst for the smallest textures due to the overhead of starting a DMA transfer. However, as texture size increases, the locked cache block transfer speed surpasses the startup overhead, allowing it to quickly become the best performing option for larger texture sizes.

NOTE:
For the locked cache option to be viable, each line of the linear special texture must start on a 64-byte boundary.


Figure 5. Conversion using locked cache including GX2CopySurface.

Modify Using the Tiling Aperture

This test demonstrates how long it takes to read in data using the tiling aperture, make a modification to that data, and then write that data out to the original texture. This takes significantly longer than the conversion test.

The algorithm presented here is substantially different from the one used in the Convert Using the Tiling Aperture test case. The code performed better than the unrolled loop that was used previously.

     // Tiling Aperture Modify
     for(u32 y = 0; y < texHeight; ++y)
     {
         const u32 lineStart = y * pitch;
         const u32 yBaseColor = (y * f) << 8;
    
         for (u32 x = 0; x < texWidth; ++x)
         {
             const u32 color = ((x << 8) * yBaseColor) | 0xff;
             const u32 temp = texAddr[lineStart + x];
             texAddr[lineStart + x] = temp ^ color;
         }
     }

     GX2Invalidate(CPU, texAddr);
     GX2Invalidate(TEXTURE, texAddr);

Figure 6 shows the results of this test. While the graph is linear, this method performs much worse than any other modification option. For small texture areas, the cost is reasonable, but this option should be avoided for large texture modifications.


Figure 6. Modification using the tiling aperture.

Modify Using the CPU

This test demonstrates how long it takes to read in data from a linear aligned texture, modify that data, write back to the linear aligned texture, and then convert it into a tiled texture using the GPU.

     // CPU Modify
     for (u32 y = 0; y < texHeight; ++y)
     {
         const u32 lineStart = y * alignedPitch;
         const u32 yBaseColor = (y * f) << 8;

         for (u32 x = 0; x < texWidth; x += 8)
         {
             u32 *writePtr = &alignedTexAddr[lineStart + x];
             writePtr[0] ^= (((x + 0) << 8) * yBaseColor) | 0xff;
             writePtr[1] ^= (((x + 1) << 8) * yBaseColor) | 0xff;
             writePtr[2] ^= (((x + 2) << 8) * yBaseColor) | 0xff;
             writePtr[3] ^= (((x + 3) << 8) * yBaseColor) | 0xff;
             writePtr[4] ^= (((x + 4) << 8) * yBaseColor) | 0xff;
             writePtr[5] ^= (((x + 5) << 8) * yBaseColor) | 0xff;
             writePtr[6] ^= (((x + 6) << 8) * yBaseColor) | 0xff;
             writePtr[7] ^= (((x + 7) << 8) * yBaseColor) | 0xff;
         }
     }

     DCFlushRange(texAddr, texWidth * texHeight * 4);
     GX2CopySurface(linearTex, tiledTex);
     GX2DrawDone();

Figure 7 shows the results of this test. With this method, data is cached the data, and performance is much better than the tiling aperture method. While cache lines do need to be pulled in, modifying the data benefits from already having the data in cache.


Figure 7. Modification using CPU including DCFlushRange and GX2CopySurface.

Modify Using the Locked Cache

This test creates a more optimized and efficient version that does not use the tiling aperture. There are not as many optimization options available for the modify scenario because we depend on the data in cache lines, we cannot zero them before use. The ability of the locked cache to perform DMA transfers on blocks of data provides a good performance increase over the Modify Using the CPU test.

     // Locked Cache Modify
     u32 *lcLines = (u32*)LCAlloc(1024 * 4);
     const transferSize = ((texWidth * 4) / 32) & 127;

     for (u32 y = 0; y < texHeight; ++y)
     {
         const u32 lineStart = y * alignedPitch;
         const u32 yBaseColor = (y * f) << 8;
    
         LCLoadDMABlocks(
             lcLines,
             &alignedTexAddr[lineStart],
             transferSize);

         LCWaitDMAQueue(0);

         for (u32 x = 0; x < texWidth; x += 8)
         {
             u32 *writePtr = &lcLines[x];
             writePtr[0] ^= (((x + 0) << 8) * yBaseColor) | 0xff;
             writePtr[1] ^= (((x + 1) << 8) * yBaseColor) | 0xff;
             writePtr[2] ^= (((x + 2) << 8) * yBaseColor) | 0xff;
             writePtr[3] ^= (((x + 3) << 8) * yBaseColor) | 0xff;
             writePtr[4] ^= (((x + 4) << 8) * yBaseColor) | 0xff;
             writePtr[5] ^= (((x + 5) << 8) * yBaseColor) | 0xff;
             writePtr[6] ^= (((x + 6) << 8) * yBaseColor) | 0xff;
             writePtr[7] ^= (((x + 7) << 8) * yBaseColor) | 0xff;
         }
    
         LCStoreDMABlocks(
             &alignedTexAddr[lineStart],
             lcLines,
             transferSize);

         LCWaitDMAQueue(0);
     }

     LCDealloc(lcLines);
     GX2CopySurface(linearTex, tiledTex);
     GX2DrawDone();

Figure 8 shows the results. When using the locked cache, there is even more improvement in our speeds as compared to the Modify Using the CPU test.


Figure 8. Modification using locked cache, including GX2CopySurface.

Compare Methods

In Figure 9, the conversion methods are compared, showing the difference in speeds between the different methods. In Figure 10, all texture sizes are compared, showing which of the three methods is the fastest for a specific size. The tiling aperture method is faster for small textures, and the locked cache is faster for all other sizes.


Figure 9. Comparison between conversion methods.



Figure 10. Compare conversion methods to show the fastest method for a specific size.

Figure 11 is a comparison of the modification methods for the difference in speeds between the methods. Figure 12 compares all texture sizes to find the fastest method for that size of data. The CPU and locked cache methods benefit from cached data and are the clear winners. If the locked cache is available, it always performs best.

While the CPU and locked cache methods are faster for modification operations, the additional requirement of keeping two copies in memory (one linear source texture and one tiled destination texture) may be prohibitive, particularly when using large textures. However, the poor speed of the tiling aperture for the larger textures may make it worth allocating the extra memory.


Figure 11. Compare modification methods.



Figure 11. Compare modification methods.

Conclusion

The data demonstrates that there are significant costs associated with converting a linear texture to a tiled texture. Ideally, all textures are converted offline, and then loaded in the tiled format.

If you must convert a texture at load time, GX2CopySurface provides the best performance if you are starting with a linear aligned texture. If your texture is not linear aligned, there are three more methods for conversion, with the optimal method depending on the size of the texture. Generally, it is recommended that the locked cache method be used if that option is available. If it is not available, use the tiling aperture since it provides a consistent execution time. Finally, CPU conversion may be slightly faster or slower depending on how modified the cache is.

If you want to modify a tiled texture during runtime, our research indicates that the locked cache method is the fastest if the locked cache is available and you can afford the additional memory for a linear copy of the texture.

If you need additional assistance to assess the benefits of these techniques for your game, contact your local Nintendo software developer support group.

Appendix

To build and run the demo

  1. Open cafe.bat in the root of the SDK.
  2. At the command prompt, change directories to $CAFE_ROOT/system/src/demo/gx2/texture/textureLinear2Tiled/.
  3. At the command prompt, type make or, to build a no debug version, type make NDEBUG=TRUE.
  4. At the command prompt, type test.sh. This invokes test.sh with the path to the RPX to run. The demo runs approximately 7.5 hours. When the demo finishes, there will be a file with timings for each test that is named * results.txt.

    You may also build and run the demo in one step. At the command prompt, type make test or, to build and run a no debug version of the demo, type make test NDEBUG=TRUE.

Timing results are laid out in the following format:

Timing results

Result Description
Width The width of the texture.
Height The height of the texture.
Total The total execution time of the test on average.
Update Time required to perform the test operation (for example, Read-Modify-Write).
CPU GX2Invalidate on the texture from the CPU cache.
GPU GX2Invalidate on the texture from the GPU texture cache.
Alloc Time to obtain a hardware tiling aperture.
Free Time to free the obtained hardware tiling aperture
Copy GX2CopySurface execution time.

You may customize on which tests are run by passing in arguments. Here is a list of the accepted arguments and what they do:

  -c <convert>          Specifies if the Convert or Modify test is performed.
                             Set 0 for 'Modify'.
                             Set 1 for 'Convert'.
                        If this option is omitted both tests are performed.
						
  -h <texture_height>   Fixes the texture height to a specific size.
                        If this option is omitted, all multiples of 64 up to 1024 are used.
						
  -i <iterations>       Change the number of iterations in the timing loop. (Default is 1000).
  
  -R                    Sets the output to 'Full Report' mode. (Default).
  
  -r                    Sets the output to 'Total Table' mode.
  
  -S                    Sets the application to perform LINEAR_SPECIAL timing and overrides -t, -i, and -W options.
                        NOTE: This option is extremely slow. For '-i 1000' the test takes about 13 hours to complete.
						
  -t <test_method>      Sets the test method number that is used for this execution.
                             Set 0 for 'Tiling Aperture'.
                             Set 1 for 'CPU'.
                             Set 2 for 'Locked Cache'.
                             Set 3 for 'GX2CopySurface'.
                        If this option is omitted all test methods are performed.
						
  -w <texture_width>    Fixes the texture width to a specific size.
                        If this option is omitted all multiples of 64 up to 1024 are used.

Example, depending on how you previously built and ran the demo:

caferun "$CAFE_ROOT/system/bin/ghs/cafe/demo/gx2/texture/textureLinear2Tiled/NDEBUG/textureLinear2Tiled.rpx" -w 512 -h 512 -i 100 -r -t 2 -c 1

OR

caferun "$CAFE_ROOT/system/bin/ghs/cafe/demo/gx2/texture/textureLinear2Tiled/DEBUG/textureLinear2Tiled.rpx" -w 512 -h 512 -i 100 -r -t 2 -c 1

1Tiling is a reordering of linear data into a more hardware-friendly format.

2A tiling aperture remaps a linear-mapped address space into a tiled (non-linear) address space.

Revision History

2011/02/21 Initial version.


CONFIDENTIAL