Multicore Processing

As advances in raw processor speeds decline, multicore processors are becoming increasingly popular choices to maximize the number of calculations that may be completed in a given period of time. The potential for increased performance by writing multicore programs comes at the cost of code complexity. This topic addresses potential pitfalls when writing code for the Wii U multicore processor, Espresso.

Introduction

A preview of the Cafe process switching model was introduced in SDK 2.03. Additional changes, feature enhancements, and performance improvements are implemented. Process switching allows system applications to be separated from the game in both execution and development.

Process Definition

On Cafe, a process is composed of a single RPX file and a separate memory address space which is isolated from other processes. The state is managed by the kernel. Each process also has a system message queue. Unlike some other operating systems, the Cafe kernel schedules processes, not threads. Threads are still scheduled in a user-mode.

A process may use CPU time on Core 2 while it is not in the foreground, or may set a configuration option to specify that it should not receive any processor time while in the background. See "Background Process" below for more information on this special mode.

NOTE:
There are three cores available for processes: Core 0, Core 1, and Core 2.

Foreground Process

At any one time, there is only one process that is allowed to interact with the user by displaying graphics, playing audio, or receiving input from controllers. This process is called the foreground process. Similarly, all resources that are only available to the foreground process are called foreground-only resources. The foreground has access to all three cores.

For more information about processes that are in the foreground state, see Foreground State.

For more information about moving from the foreground to the background, see Moving to the Background.

Background Process

If your application has been requested to move to the background, it still has some actions it can take while waiting to return to the foreground.

For more information about processes that are in the background state, see Background State.

For more information about moving from the background to the foreground, see Moving to the Foreground.

Exiting

In certain circumstances, a process will need to exit. ProcUIProcessMessages will return PROCUI_STATUS_EXIT to indicate this. At this point, the process will need to cleanup any resources and return from main.

Foreground and Background Switching Libraries

For more information about the libraries specific to foreground and background switching, see:

Demos for Process Switching

$CAFE_ROOT/system/src/demo/os/switch

Espresso CPU Information

NOTE:
For the purpose of this discussion, we briefly explain the Espresso cache coherency model. For more complete information about the Espresso CPU than is presented here, see IBM Espresso RISC Microprocessor Developer's User Manual, Version 2 on your local Nintendo developer support group website., Espresso CPU Performance Guide, and Cache Performance Implications of Shared Data.

Each of the three cores tracks operations on the other caches: A core may both snoop and intervene with data cache operations on other cores. A snoop occurs when a core writes to its cache. The other cores observe this and invalidate that cache line if it is present in their own caches. An intervention occurs when the cache on one core may fulfill a request from another core (for example, a data cache miss on Core 0 L1 cache may be fulfilled by the Core 1 L1 cache, rather than the Core 0 L2 cache). However, ICache misses are not fetched from other core caches.


Figure 1: On the left, cores 1 and 2 snoop the Core 0 write to L1 cache, invalidating that line in their own
cache. On the right, Core 0 intervenes and provides a copy of a cache line that missed the Core 1 cache.

General Advice

General multicore computing advice is applicable on Espresso. The need for communication between threads is a primary consideration when deciding which code to offload to other cores. Each point where threads need to synchronize or share data increases the overall complexity of the program, and the potential for data inconsistency, deadlocks, and synchronization overhead. Coding and debugging effort tends to increase with the amount of synchronization, making managing the complexity of an application an important consideration.

Tasks with few synchronization points make good candidates for moving to their own threads. An ideal task is one which can run independently, given some input, and then only interact when complete.

File Loading and Decompression

Loading assets and decompressing them tend to be well-suited tasks for their own threads on separate cores. A simple file load exemplifies the ideal task: it communicates only to receive information about the path to load and buffer to write to, and to indicate when it is finished. Similarly, decompression communicates only to receive information about the input and output buffers, and to indicate when it is finished. In both cases, the actual work normally does not need further input.

In the case of streamed loading, more communication may be desired with the loading thread, for example, to load gameplay assets or to cancel or change a request based on player behavior. In this example, a delay in communication is usually not problematic: the loading thread may check a shared object every frame (or less frequently) to determine if the current asset loading is still needed. Reducing the frequency and immediacy of checks may alleviate some of the synchronization overhead.

Non-Essential Effects

Graphical or audio effects that improve the feel of the game but are not essential to the gameplay may also be good candidates for moving to other cores. Procedural content and animations or physics calculations, such as cloth, hair, water, or debris, which are not likely to affect the actual gameplay experience, do not require absolute synchronization in most cases. Synchronizing these kinds of effects once per frame with the update thread is usually sufficient to keep them looking good.

Rendering

Performing the GX2 calls for a scene may consume a large portion of the CPU time. Moving rendering work to a different core from the game update thread may increase performance. The update thread can produce a buffer of rendering commands, which are then consumed by the rendering thread. This reduces the interaction by having only a single synchronization point per frame when the commands are passed on to the rendering thread.

The tradeoff is that the rendering thread will be a frame behind the game update thread, but the potential increase in frame rate from splitting up the work may pay off and keep latency low.

Parallelizable Tasks

When processing a large amount of data with a CPU-intensive task, it may be beneficial to break the work up across all of the available cores. However, creating and destroying threads as needed may incur unnecessary runtime penalties. In this cases, it usually is better to maintain a pool of threads that can be called to perform additional work. More than one thread per core in the pool is not recommended for CPU-bound code, but if tasks include waiting for asynchronous operations to complete, more threads may be useful.

The Cafe SDK provides a Multiprocessor (MP) task API, found in $CAFE_ROOT/system/include/cafe/mp.h, that contains a simple task queue that may be used for creating a thread pool. Each thread that calls MPRunTasksFromTaskQ processes up to deque_granularity tasks at a time, until no more remain. Alternately, tasks may be dequeued and run manually with MPDequeTask or MPDequeTasks. Tasks are processed in order with no priority in the task queue. However, creating several pools that are serviced by threads with different priorities might enable higher-priority tasks to be completed first. For more information about the MP library, see Cafe Core OS (COS) Overview.

Espresso Specifics

Threads and Core Affinity

Wii U applications should take advantage of the fixed hardware platform configuration when dividing work between cores. Core affinity may be specified in the attr parameter of the OSCreateThread function. In fact, there is no automatic thread balancer on the Wii U, which makes it essential to specify the attr parameter. Threads run on the cores of the threads that created them, unless specified otherwise by the attr parameter. Threads may be moved with the OSSetThreadAffinity function. However, this should not be performed frequently.

Unlike other operating systems, Wii U thread switching is a user-mode concept, left to the control of the individual application. Rather than running at a set rate, the scheduler runs only after interrupts and certain other events occur. When it does run, the scheduler selects the highest priority thread to run, with the currently-running thread getting preference if there are other threads with the same priority. For more information about the thread scheduler, see Cafe Core OS (COS) Overview.

OS Details

The OS performs some tasks on different cores. When these tasks run, the CPU cache is effected, which has the potential to arbitrarily slow down application code. Currently, most sound processing is performed on the core from which AXInit is called. Some graphics processing is performed on Core 0, and all background processes run on Core 2, although these are implementation details subject to change between SDK revisions.

When a process is moved to the background, it has access to Core 2, and the Filesystem and Network libraries work on these cores. For this reason, it is recommended that you use Core 2 for those features. For more information, see Application Lifecycle.

The operations of the underlying OS may affect the cache performance of an application. When an OS routine runs, its instructions usually need to eject something in the I-Cache, which does not allow interventions. In addition, any data needed by the routine is may not be in cache, causing evictions for the data that is currently in cache. Applications should seek to avoid running code that would suffer greatly from such an interruption on cores 0 and 2. However, the benefit of parallelizing such code may be greater than the risk of cache eviction by an OS routine.

Cache Implications for Concurrency

The automatic coherency of the Espresso cache snooping by invalidating cache lines written to by other cores alleviates the burden of coherency from the programmer. However, it comes at a cost. When a cache line is invalidated, it must be fetched again. The data caches in Espresso are usually able to intervene by retrieving cache lines from each other rather than from memory. In practice, tasks with frequent writes from separate cores may be slowed significantly by the interactions between the caches. The following table illustrates that writes to separate cache lines are over 50% faster than writes to the same cache line. For example, it takes less than two thirds of the time to write the same amount of data.

Same Cache Line Separate Cache Lines Speed Ratio
49280 ticks 31225 ticks 1.0:1.58
Table 1: Cycle cost of each core performing 100,000 writes.

With this awareness, to safeguard performance when parallelizing code, consider the following.


Figure 2: Grouping writes to align with cache lines. Core 0 performs the work for 0x00-0x1F, and Core 2 handles 0x20-0x2B.

Judicious use of DCZeroRange may increase the speed of the code by as much as 60%. Understand that DCTouchRange is usually detrimental and should be avoided. The Espresso CPU Performance Guide explains the full implications of DCZeroRange and DCTouchRange on the hardware. The accompanying source code demonstrates the results of using these functions when parallelizing operations across the three cores.

Demonstration

The included source code demonstrates the performance degradation of false sharing among cores, and the effectiveness of using the dcbt and dcbz commands to manipulate cache across all three cores. The source code may be found in the $CAFE_ROOT/system/src/demo/os/multicore folder.

To run the demo:

  1. In cafe.bat:
    1. Change the CAFE_ROOT variable to point to the correct directory for the SDK with the Wii U CPU Profiler installed.
    2. If Cygwin is not installed at C:\cygwin, change the CYGWIN_PATH variable in cafe.bat to point to the correct directory.
  2. Ensure that the SDK and CAT-DEV are configured as specified in the CAT-DEV QuickStart Guide.
  3. Double-click cafe.bat
  4. At the command prompt, type cd $CAFE_ROOT/system/src/demo/os/multicore.
  5. At the command prompt, type make run. The demo runs and prints the results to the monitor.
  6. Press CTRL+C to stop the printout loop.
  7. At the command prompt, type cafestop to stop the game.
  8. Optionally, follow the instructions in the Wii U CPU Profiler Package to profile this demo.

The test shows the results of several multicore computations.

The TestDCCosts calculations measure the results of using the DCZeroRange and DCTouchRange on a large amount of data. Unrolling the loop and using DCZeroRange had the most positive effect, with the code running up to 62% faster. Likewise, it demonstrates the null or negative effect of attempting to use DCTouchRange to prefetch data.


Figure 3: Output of multicore cache test program.

The TestInvalidationCosts function sets up three tests:

These tests write to, and then read from the specified address. The functions InvalidationTestFuncSameData, InvalidationTestFuncFalseShare, and InvalidationTestFuncIndependent are purposefully identical. They are separated to make it easier to compare the instrumented profiles of these functions.

In the instrumented profile of InvalidationTestFuncIndependent where each core writes to independent cache lines, each core takes roughly the same amount of time and have relatively few cache misses. InvalidationTestFuncFalseShare shows that only one core achieved the speed of the individual cores in the other test. The other two cores speed was more than halved because they had to contend with the invalidations from Core 2 causing their L1 caches to miss heavily.


Figure 4: Profile results comparing L1 cache misses when false sharing occurs between cores.

See Also

Application Lifecycle
ProcUI Library Overview
SYSAPP Library

Revision History

2013/07/08 Added to Cafe SDK.


CONFIDENTIAL