Espresso CPU Performance Guide

This guide provides techniques for enhancing the performance of programs running on the Espresso application processor.

Sections

Compiler

  1. Compiler inlining tips

    An inlining flag is commonly used for code optimization, but there can be link errors with inline functions when using -Omaxdebug or -Onoinline. It is suggested to use -Odebug instead. If you get a link error because a function marked inline was not inlined properly, change the use of inline to "static inline".

  2. Compiler intermodule optimization

    Intermodule optimizations may be turned on using -OI. However, this occasionally results in empty relocations (pointing to NULL sections). The warning may be turned on for makerpl using the -warnempty flag. This option is available to find a reproducible case for improving the GHS compiler.

Floating-Point

  1. Use paired-single instruction intrinsics and/or types

    Espresso's instruction set contains two-way SIMD instructions for single-precision floating-point computations. The compiler provides intrinsics to enable a programmer to access these instructions without using assembly language. The paired-single extensions support a new data type __vec2x32float. Several PS instructions and some key float instructions have been provided with equivalent intrinsics. See the PS Intrinsics Courseware notes for information on the instructions. The GHS compiler version 5.3.6 brings improvement to code written using PS intrinsics.

Cache and Memory

  1. Use Espresso's DMA

    Espresso has a DMA engine per core. In combination with the locked cache, the primary benefits of using DMAs are:

    • It is the fastest way to transfer data to or from Espresso. The DMA bandwidth is 2.6 to 3.6 times the cache bandwidth using load/store instructions.
    • It provides parallelization of computation and asynchronous data transfer.
    • It allows you to compute on a large data set (larger than L2 size) without displacing other cached data from the L2.

    The DMA API is listed under the cache section of the MAN pages.

    Uses include animation blending, skinning, streaming algorithms, and gather/scatter algorithms. The source code for an example streaming API library is provided with the SDK. See stream.h for a description of the stream model and API.

  2. Use the dcbz (data cache block zero) instruction

    Use the dcbz (data cache block zero) PPC instruction when an entire cache block (32 bytes) will be written. Espresso's cache does write allocation on a cache miss. If the entire cache block is written, there is no need to read the cache block from memory. The latency to read the missing cache block can be avoided. The dcbz instruction will allocate the cache block in the cache if it is not already in there. The DCZeroRange function can be used to dcbz multiple cache blocks. See the Espresso manual for more information on the dcbz instruction.

    Functions that can benefit from dcbz include memcpy or memset.

  3. Do not use the dcbt (data cache block touch) instruction

    Do not use the dcbt (data cache block touch) instruction. Espresso does not allow more than one outstanding data miss per core. If the prefetch generates a cache miss, a subsequent load/store that misses the cache will have to wait until the prefetch is finished. With Espresso's 6-entry completion queue and the long latency to MEM2, the prefetching is unlikely to be effective. A normal load/store to demand fetch the data will often be a better choice.

    Likewise, avoid using the DCTouchRange function. With only one data miss outstanding, a loop of dcbt instructions will not allow data prefetching and computation to overlap.

  4. memcpy performance

    The following table lists the single core copy bandwidth possible for various implementations of memcpy. Copy bandwidth is defined as copy data size divided by the transfer time. The total system bandwidth used is twice the copy bandwidth because twice the copied data size is actually transferred, once to read the source buffer and once to write the destination buffer.

    These initial numbers were obtained on current devkits (CAT-DEV or CAT-R). They are expected to improve on future units.

    memcpy Implementation Copy Bandwidth From MEM2 (MB/s)
    Standard C Library 100
    using dcbz 170
    dcbz + dcbt 230
    LC/DMA 520

  5. write performance (e.g., memset)

    The following table lists the single core MEM2 bandwidth possible for various implementations of memset. These initial numbers were obtained on current devkits.

    memset Implementation MEM2 Bandwidth (MB/s)
    Standard C Library 160
    Using dcbz 1000
    LC/DMA 1900 (3600 soon)

  6. 64-bit loads and stores

    Espresso has 64-bit floating-point load and store instructions. In C, pointers to floating-point doubles (f64) can be used to move data with half the number of instructions than with 32-bit integer loads/stores.

    However, by using floating-point instructions, the thread now becomes a floating-point thread. A floating-point thread has its floating-point registers saved/restored on thread context switches. If the thread was not already a floating-point thread, using 64-bit loads/stores will increase its thread context switch time. Depending on the thread switching frequency, this overhead may or may not be noticeable.

  7. Block algorithms to take advantage of L1d and L2 sizes

    Changing algorithms to work on smaller blocks of data at a time, rather than all of the aggregate data at once, can increase performance. The first goal here would be to block the data so that it fits in L2. Afterwards, if possible, blocking to fit in L1d will gain additional performance.

Synchronization and Atomics

  1. Use 32-bit atomics instead of 64-bit atomics

    Espresso does not have native 64-bit integers. The 64-bit atomics API acquires a spinlock to atomically update two 32-bit values. In addition, our implementation disables and enables interrupts, each requiring a system call. Our 64-bit API implementation is much slower than the 32-bit atomics APIs. If 64-bit values are not needed, use the 32-bit atomics API instead.

  2. Avoid busy loops with synchronizations

    Busy loops involving synchronization cause two performance-related issues: memory bus traffic and starvation. The first is due to the sync instruction required in locking primitives. The sync instruction generates a memory transaction. A typical busy polling loop acquires a lock, sees that a condition is not met, and then releases the lock. To avoid unnecessary locking, poll on the condition until true without obtaining the lock, then acquire the lock and check the condition again. If the condition is not satisfied, go back to the polling step. An example of this is the test and set lock algorithm.

    Starvation is caused by a core continually acquiring and releasing a lock without helping the system make progress as a whole. In such a tight loop, a core can make it difficult for another core to successfully obtain the lock.

Examples

  1. memcpy using dcbz

    If the data size is large and/or there is a possibility that the destination buffer is not cached, using dcbz can significantly improve the speed of memcopy.

    The following implementations assume that the source and destination buffers are cache block size (32-byte) aligned, and that the data size is a multiple of the cache block size.

    Let the standard C library's memcpy have bandwidth A. Using dcbz on the destination buffer will increase the copy bandwidth to 1.7 x A. There is no need to read in the destination buffer from memory because it will be completely overwritten. The following is an example implementation.

    // Void memcpy_dcbz(void* dest, void* src, u32 num_words)
            .global memcpy_dcbz
    memcpy_dcbz:
            srwi   r5, r5, 3            // r5 <- number of cache blocks
            li     r7, 8                // 8 lwz+stw per cache block
            li     r9, 0                // index
    memcpy_dcbz_outer_loop:
            dcbz   r9, r3               // dcbz dest cache block
            mtctr  r7
    memcpy_dcbz_inner_loop:
            lwzx   r6, r9, r4
            stwx   r6, r9, r3
            addi   r9, r9, 4
            bdnz   memcpy_dcbz_inner_loop
            subic. r5, r5, 1
            bne    memcpy_dcbz_outer_loop
            blr
    

    In addition to using dcbz, using dcbt on the source buffer will further increase the copy bandwidth to 2.3 x A. This works well when the copy size is greater than one half the L2 size because source and destination buffers cannot both fit inside the L2. For the cases where the copy size is less than one half the L2 size, this assumes that the source buffer is not soon reused because the dcbt will push the source buffer out of the L2.

    // void memcpy_dcbzdcbf(void* dest, void* src, u32 num_words)
            .global memcpy_dcbzdcbf
    memcpy_dcbzdcbf:
            srwi   r5, r5, 3            // r5 <- number of cache blocks
            li     r7, 8                // 8 lwz+stw per cache block
            li     r9, 0                // index
    memcpy_dcbzdcbf_outer_loop:
            dcbz   r9, r3               // dcbz dest cache block
            mtctr  r7
    memcpy_dcbzdcbf_inner_loop:
            lwzx   r6, r9, r4
            stwx   r6, r9, r3
            addi   r9, r9, 4
            bdnz   memcpy_dcbzdcbf_inner_loop
            subi   r10, r9, 4
            dcbf   r10, r4              // dcbf(src)
            subic. r5, r5, 1
            bne    memcpy_dcbzdcbf_outer_loop
            blr
    

Revision History

2013/08/06 Removed references to older SDKs.
2013/05/08 Automated cleanup pass.
2011/10/25 Added sections 3.6, 3.7, and 4.2.
2011/09/26 Made minor text changes.
2011/09/23 Added performance tables for memcpy and memset.
2011/09/15 Initial version.


CONFIDENTIAL