Wii U Copy Implementations and Optimizations

Waiting for memory to be copied from one location to another may consume a considerable proportion of the game frame time. This topic provides a comparison of various copy implementations and optimizations may be implemented on the Wii U to minimize these times and copy data as swiftly as possible.

Copy Functions

For each of the copy functions, source and destination buffers were 64-byte aligned. To ensure that all of the functions were compared on equally, the cache is invalidated prior to each copy.

memcpy

Internally, memcpy uses OSBlockMove with data flushing turned on. Data may be any size or alignment. With data flushing turned on, memcpy is at its most efficient when copying small amounts of cached memory. The source and destination buffers may overlap1, technically making this a memmove function rather than a memcpy function.

OSBlockMove

OSBlockMove is an optimized memcopy function in the Wii U SDK. It is completely safe in that it works with any buffer alignment and size, and comes with the option to flush the CPU cache on completion to prevent cache pollution. Unlike other copy methods, the source and destination buffers may overlap1. Internally, OSBlockMove has two implementation specific paths; one for data which is already loaded into cache and a second for uncached data. Since our benchmarks only start with a cold cache, we tested the latter case. Apparently, this path uses the Espresso dcbz instruction.

dcbz

Data cache block zero (dcbz) is an implementation of memcpy that uses the dcbz instruction to save time by not reading the destination buffer from memory since we are going to be writing over the whole thing anyway. This optimization requires 32-byte aligned source and destination buffers along with sizes that are also multiples of 32-bytes.

dcbzdcbf

Building on using the dcbz instruction, this implementation also uses the data cache block flush (dcbf) instruction to flush the source buffer from the cache every time a cache line is filled. This ensures that the cache is free for future reads in the overall copy operation. Since this is a cache-optimization, data must be 32-byte aligned with copies in multiples of 32-bytes.

dcbz_64

This implementation is the same as dcbz, except that it copies 8 bytes at a time instead of 4 by using the 64-bit floating-point registers. This means that it uses only half of the instructions. Note, when using a floating-point instruction on a thread, the system begins to save the floating-point context on all future thread switches. This may have a minor (~0.002μs per switch) performance penalty on any thread that is not already using floating-point operations. This implementation has the same alignment and size requirements as dcbz.

dcbzdcbf_64

Similar to dcbz_64, this implementation of memcopy is the same as dcbzdcbf, but with 8-byte copies instead of 4. The same context warning as dcbz_64 applies. This also has the same alignment and size requirements as dcbzdcbf.

LCDMA

A unique memcopy implementation to the Cafe CPU, LCDMA uses the Espresso locked cache, the 16KB region of L1 data cache on each core, which is reserved for allocation and use by an application. LCDMA bypasses the regular cache, which requires that the source and destination buffers be flushed or invalidated from the data cache prior to use. LCDMA instructions have strict alignment requirements for the source and destination buffers of 64-bytes, along with the requirement that the size of each buffer must be a multiple of 64.

DMAE

This memcopy implementation uses the DMA controller on the GPU rather than one of the CPU cores. Unlike the other memcopy implementations, this function may be run asynchronously in parallel with other operations on the GPU and CPU. It uses the high-speed bus attached to the GPU for increased speed, but it must share bandwidth with the GPU rendering operations. DMAE uses its own GPU command buffer, which is consumed in parallel to graphics commands.

DMAE operations require that both buffers be 8-byte aligned, with sizes in multiples of 4 bytes. Since DMAE ignores the CPU cache, both ranges must be flushed or invalidated prior to use.

Memcopy Comparison





Graph Analysis

All graphs are for Core 1 with the size ranging between 0 and 1.5 times the size of the L2 cache (0 to 3 MB). If the same relative size range is applied to Core 0 or 2 (for example, 0 to 768KB), the graphs of the results are the same. This is particularly important to note for the zoomed-in Speed vs. Copy Size graph. Aside from the obvious that DMAE is FAST, there are a few important characteristics to note:

Conclusions

DMAE is the fastest copy method for copies of any significant size. However, because using it consumes GPU bandwidth, its use should be application dependent. Heavily GPU or memory bound applications should be cautious about DMAE use, with the exception of loading screens, backing up MEM1 when the HOME Button is pressed, and other such times when the GPU is not under heavy use. Conversely, a CPU bound application sees an overall performance gain when using DMAE for general copies greater than 5KB in size.

If you are copying data using the CPU, you should use LCDMA, provided that there is locked cache space available. If locked cache space is not available, use dcbzdcbf_64 or dcbz_64, depending on copy size and current CPU core. The minor performance penalty of a floating-point context-switch leaves the speed gain from using these worthwhile, even on a thread that, otherwise, has no floating-point operations.

Appendix

LCDMA

extern void *lc4kbBlock; // This points to 4KB of LCAlloc memory

// Note: Addresses and size must be 64-byte aligned
void memcpy_lcdma(void *dest, const void *src, u32 numWords)
{
	// Unfortunately, not all SDK functions are const correct.
	u32 *srcWord = reinterpret_cast<u32*>(const_cast<void*>(src));
	u32 *dstWord = reinterpret_cast<u32*>(dest);

	// lcdma does not respect the cache, so we need to
	// clear it manually
	DCFlushRange(src, numWords*4);
	DCInvalidateRange(dest, numWords*4);

	// Copy 4kb at a time. Maximum for LCDMA
	for(; numWords >= 1024; numWords -= 1024, srcWord += 1024, dstWord -= 1024)
	{
	    LCLoadDMABlocks(lc4kbBlock, srcWord, 0);
	    LCStoreDMABlocks(dstWord, lc4kbBlock, 0);
	}

    // Copy whatever is left over.
    if(numWords)
    {
        u32 blocksLeft = (numWords+7) >> 3;

        LCLoadDMABlocks(lc4kbBlock, srcWord, blocksLeft);
        LCStoreDMABlocks(dstWord, lc4kbBlock, blocksLeft);
    }
}

DMAE

// Note: Addresses must be 8-byte aligned, size must be 4 byte aligned.
void memcpy_dmae(void *dest, const void *src, u32 numWords)
{
    // DMAE does not respect the cache. Clear it manually.
    DCFlushRange(src, numWords*4);

    while(!DMAECopyMemWait(dest, src, numWords, DMAE_ENDIAN_NONE));

dcbz

// Note: Addresses must be 32-byte aligned.
asm void memcpy_dcbz_asm(dest, src, numWords)
{
%reg dest, src, numWords %lab outer_loop, inner_loop

    srwi r5, numWords, 3    // r5 = num cache blocks.
    li r7, 8                // 8 moves per cache block (32 bytes total, 4 byte chunks).

    subi dest, dest, 4      // Move dest back 4, this is required for use with stwu later.
    subi src, src, 4        // Move src back 4, this is required for use with lwzu later.
    li r9, 4                // Word size.

outer_loop:
    dcbz r9,                // dcbz on dest.
    mtctr r7                // store r7(8) in the counter.

inner_loop:
    lwzu r6, 4(src          // Load the word from src+4 into r6. Move src forward 4.
    stwu r6, 4(dest)        // Store the word into dest+4, move dest forward 4.

    bdnz inner_loop         // Decrement the counter and loop if not zero.

    subic. r5, r5, 1        // decrement r5(num cache blocks) and update CR.
    bne outer_loop          // Loop if there are still cache blocks left.
}

void memcpy_dcbz(void *dest, const void *src, u32 numWords)
{
    memcpy_dcbz_asm(dest, src, numWords);
}

dcbzdcbf

// Note: Addresses must be 32-byte aligned.
asm void memcpy_dcbzdcbf_asm(dest, src, numWords)
{
%reg dest, src, numWords %lab outer_loop, inner_loop

    srwi r5, numWords, 3    // r5 = num cache blocks.
    li r7, 8                // 8 moves per cache block (32 bytes total, 4 byte chunks).

    subi dest, dest, 4      // Move dest back 4, this is required for use with stwu later.
    subi src, src, 4        // Move src back 4, this is required for use with lwzu later.
    li r9, 4                // Word size.

outer_loop:
    dcbz r9, dest           // dcbz on current section of r3(dest).
    mtctr r7                // store r7(8) in the counter.

inner_loop:
    lwzu r6, 4(src)         // Load the word from src+4 into r6. Move src forward 4.
    stwu r6, 4(dest)        // Store the word into dest+4, move dest forward 4.

    bdnz inner_loop         // Decrement the counter and loop if not zero.

    dcbf 0, src             // flush the datacache at src for the next iteration.

    subic. r5, r5, 1        // decrement r5(num cache blocks) and update CR.
    bne outer_loop          // Loop if there are still cache blocks left.
}

void memcpy_dcbzdcbf(void *dest, const void *src, u32 numWords)
{
    memcpy_dcbzdcbf_asm(dest, src, numWords);
}

dcbz_64

// Note: Addresses must be 32-byte aligned.
asm void memcpy_dcbz_64_asm(dest, src, numWords)
{
%reg dest, src, numWords %lab outer_loop, inner_loop

memcpy_dcbz_f:
    srwi r5, numWords, 3    // r5 = num cache blocks.
    li r7, 4                // 4 moves per cache block (32 bytes total, 8 byte chunks).

    subi dest, dest, 8      // Move dest back 8, this is required for use with stwu later.
    subi src, src, 8        // Move src back 8, this is required for use with lwzu later.
    li r9, 8                // DWord size.

outer_loop:
    dcbz r9, dest           // dcbz on current section of r3(dest).
    mtctr r7                // store r7(8) in the counter.

inner_loop:
    lfdu f6, 8(src)         // Load the dword from src+8 into r6. Move src forward 8.
    stfdu f6, 8(dest)       // Store the dword into dest+8, move dest forward 8.

    bdnz inner_loop         // Decrement the counter and loop if not zero.

    subic. r5, r5, 1        // decrement r5(num cache blocks) and update CR.
    bne outer_loop          // Loop if there are still cache blocks left.
}
void memcpy_dcbz_64(void *dest, const void *src, u32 numWords)
{
    memcpy_dcbz_64_asm(dest, src, numWords);
}   

dcbzdcbf_64

// Note: Addresses must be 32-byte aligned.
asm void memcpy_dcbzdcbf_64_asm(dest, src, numWords)
{
%reg dest, src, numWords %lab outer_loop, inner_loop

memcpy_dcbzdcbf_f:
    srwi r5, numWords, 3    // r5 = num cache blocks.
    li r7, 4                // 4 moves per cache block (32 bytes total, 8-byte chunks).

    subi dest, dest, 8      // Move dest back 8, this is required for use with stwu later.
    subi src, src, 8        // Move src back 8, this is required for use with lwzu later.
    li r9, 8                // DWord size.

outer_loop:
    dcbz r9, dest           // dcbz on current section of r3(dest).
    mtctr r7                // store r7(4) in the counter.

inner_loop:
    lfdu f6, 8(src)         // Load the dword from src+8 into r6. Move src forward 8.
    stfdu f6, 8(dest)       // Store the dword into dest+8, move dest forward 8.

    bdnz inner_loop         // Decrement the counter and loop if not zero.

    dcbf 0, src             // flush the datacache at src for the next iteration.

    subic. r5, r5, 1        // decrement r5(num cache blocks) and update CR.
    bne outer_loop          // Loop if there are still cache blocks left.
}
void memcpy_dcbzdcbf_64(void *dest, const void *src, u32 numWords)
{
    memcpy_dcbzdcbf_64_asm(dest, src, numWords);
}

_________________________

1Beginning with SDK 2.09.22, there is a bug that causes the data to be zeroed if the source and destination buffers are the same.

Revision History

2013/08/05 Converted from PDF to HTML format.


CONFIDENTIAL