This guide provides techniques for enhancing the performance of programs running on the Espresso application processor.
An inlining flag is commonly used for code optimization, but there can be link errors with inline functions when using
-Onoinline. It is suggested to use
-Odebug instead. If you get a link error because a function marked inline was not inlined properly, change the use of inline to "static inline".
Intermodule optimizations may be turned on using
-OI. However, this occasionally results in empty relocations (pointing to
NULL sections). The warning may be turned on for
makerpl using the
-warnempty flag. This option is available to find a reproducible case for improving the GHS compiler.
Espresso's instruction set contains two-way SIMD instructions for single-precision floating-point computations. The compiler provides intrinsics to enable a programmer to access these instructions without using assembly language. The paired-single extensions support a new data type
__vec2x32float. Several PS instructions and some key float instructions have been provided with equivalent intrinsics. See the PS Intrinsics Courseware notes for information on the instructions. The GHS compiler version 5.3.6 brings improvement to code written using PS intrinsics.
Espresso has a DMA engine per core. In combination with the locked cache, the primary benefits of using DMAs are:
The DMA API is listed under the cache section of the MAN pages.
Uses include animation blending, skinning, streaming algorithms, and gather/scatter algorithms. The source code for an example streaming API library is provided with the SDK. See
stream.h for a description of the stream model and API.
dcbz(data cache block zero) instruction
dcbz (data cache block zero) PPC instruction when an entire cache block (32 bytes) will be written. Espresso's cache does write allocation on a cache miss. If the entire cache block is written, there is no need to read the cache block from memory. The latency to read the missing cache block can be avoided. The
dcbz instruction will allocate the cache block in the cache if it is not already in there. The
DCZeroRange function can be used to
dcbz multiple cache blocks. See the Espresso manual for more information on the
Functions that can benefit from
dcbt(data cache block touch) instruction
Do not use the
dcbt (data cache block touch) instruction. Espresso does not allow more than one outstanding data miss per core. If the prefetch generates a cache miss, a subsequent load/store that misses the cache will have to wait until the prefetch is finished. With Espresso's 6-entry completion queue and the long latency to MEM2, the prefetching is unlikely to be effective. A normal load/store to demand fetch the data will often be a better choice.
Likewise, avoid using the
DCTouchRange function. With only one data miss outstanding, a loop of
dcbt instructions will not allow data prefetching and computation to overlap.
The following table lists the single core copy bandwidth possible for various implementations of
memcpy. Copy bandwidth is defined as copy data size divided by the transfer time. The total system bandwidth used is twice the copy bandwidth because twice the copied data size is actually transferred, once to read the source buffer and once to write the destination buffer.
These initial numbers were obtained on current devkits (CAT-DEV or CAT-R). They are expected to improve on future units.
||Copy Bandwidth From MEM2 (MB/s)|
|Standard C Library||100|
The following table lists the single core MEM2 bandwidth possible for various implementations of
memset. These initial numbers were obtained on current devkits.
||MEM2 Bandwidth (MB/s)|
|Standard C Library||160|
|LC/DMA||1900 (3600 soon)|
Espresso has 64-bit floating-point load and store instructions. In C, pointers to floating-point doubles (
f64) can be used to move data with half the number of instructions than with 32-bit integer loads/stores.
However, by using floating-point instructions, the thread now becomes a floating-point thread. A floating-point thread has its floating-point registers saved/restored on thread context switches. If the thread was not already a floating-point thread, using 64-bit loads/stores will increase its thread context switch time. Depending on the thread switching frequency, this overhead may or may not be noticeable.
Changing algorithms to work on smaller blocks of data at a time, rather than all of the aggregate data at once, can increase performance. The first goal here would be to block the data so that it fits in L2. Afterwards, if possible, blocking to fit in L1d will gain additional performance.
Espresso does not have native 64-bit integers. The 64-bit atomics API acquires a spinlock to atomically update two 32-bit values. In addition, our implementation disables and enables interrupts, each requiring a system call. Our 64-bit API implementation is much slower than the 32-bit atomics APIs. If 64-bit values are not needed, use the 32-bit atomics API instead.
Busy loops involving synchronization cause two performance-related issues: memory bus traffic and starvation. The first is due to the sync instruction required in locking primitives. The sync instruction generates a memory transaction. A typical busy polling loop acquires a lock, sees that a condition is not met, and then releases the lock. To avoid unnecessary locking, poll on the condition until true without obtaining the lock, then acquire the lock and check the condition again. If the condition is not satisfied, go back to the polling step. An example of this is the test and set lock algorithm.
Starvation is caused by a core continually acquiring and releasing a lock without helping the system make progress as a whole. In such a tight loop, a core can make it difficult for another core to successfully obtain the lock.
If the data size is large and/or there is a possibility that the destination buffer is not cached, using
dcbz can significantly improve the speed of
The following implementations assume that the source and destination buffers are cache block size (32-byte) aligned, and that the data size is a multiple of the cache block size.
Let the standard C library's
memcpy have bandwidth A. Using
dcbz on the destination buffer will increase the copy bandwidth to 1.7 x A. There is no need to read in the destination buffer from memory because it will be completely overwritten. The following is an example implementation.
// Void memcpy_dcbz(void* dest, void* src, u32 num_words) .global memcpy_dcbz memcpy_dcbz: srwi r5, r5, 3 // r5 <- number of cache blocks li r7, 8 // 8 lwz+stw per cache block li r9, 0 // index memcpy_dcbz_outer_loop: dcbz r9, r3 // dcbz dest cache block mtctr r7 memcpy_dcbz_inner_loop: lwzx r6, r9, r4 stwx r6, r9, r3 addi r9, r9, 4 bdnz memcpy_dcbz_inner_loop subic. r5, r5, 1 bne memcpy_dcbz_outer_loop blr
In addition to using
dcbt on the source buffer will further increase the copy bandwidth to 2.3 x A. This works well when the copy size is greater than one half the L2 size because source and destination buffers cannot both fit inside the L2. For the cases where the copy size is less than one half the L2 size,
this assumes that the source buffer is not soon reused because the
dcbt will push the source buffer out of the L2.
// void memcpy_dcbzdcbf(void* dest, void* src, u32 num_words) .global memcpy_dcbzdcbf memcpy_dcbzdcbf: srwi r5, r5, 3 // r5 <- number of cache blocks li r7, 8 // 8 lwz+stw per cache block li r9, 0 // index memcpy_dcbzdcbf_outer_loop: dcbz r9, r3 // dcbz dest cache block mtctr r7 memcpy_dcbzdcbf_inner_loop: lwzx r6, r9, r4 stwx r6, r9, r3 addi r9, r9, 4 bdnz memcpy_dcbzdcbf_inner_loop subi r10, r9, 4 dcbf r10, r4 // dcbf(src) subic. r5, r5, 1 bne memcpy_dcbzdcbf_outer_loop blr
2013/08/06 Removed references to older SDKs.
2013/05/08 Automated cleanup pass.
2011/10/25 Added sections 3.6, 3.7, and 4.2.
2011/09/26 Made minor text changes.
2011/09/23 Added performance tables for memcpy and memset.
2011/09/15 Initial version.