Offload CPU Processing to the GPU

Using the Wii U Console stream-out feature to take advantage of spare GPU cycles.

When a game is CPU-bound, there are blocks of time that the graphics processing unit (GPU) is idle and can be put to work assisting the CPU. In the following diagram of a CPU-bound game, blocks A, B and C are CPU tasks that need to be finished before drawing can start.

The GPU waits for the CPU to finish a draw call, and then starts processing the drawing commands. After the drawing is finished, the GPU waits for the next command.

The OpenGL Shading Language (GL SL) shader programs used by the GPU can be used to perform operations other than drawing. The idle GPU time can be used by using stream-out shaders. Wii U supports stream-out functionality, which allows the CPU to read the results of GPU operations. With this functionality, two-way communications between the CPU and GPU may be established.

In the following diagram, the C block in each of the frames is rewritten as a stream-out shader. The draw logic is broken into sections so that as much data as possible can be rendered early, allowing shapes that depend on stream-out results to be rendered later. Offloading block C to the GPU frees enough time on the CPU that a new D block can be added.

Stream-out shaders are identical to rasterizing shaders, except that the stream-out shader does not pass the results of the vertex shader to the pixel shader like a rasterizing shader does. Instead, the vertex shader output is sent to a buffer that can be read by the CPU.

Creating a Stream-Out Shader

The shader used for stream-out is a standard vertex shader that includes a stream-out varings file. Optionally, the stream-out shader can make use of a geometry shader. The stream-out shader does not include a pixel shader. The output variables that are normally passed from the vertex shader or geometry shader to the pixel shader are specified in the stream-out varing file. Those variables are then captured by a stream-out buffer.

Vertex Shader

The vertex shader works like a typical vertex shader. The goal of the main function is to take the values from the input attributes and uniforms, and then write results to the output variables. The following is the overall layout of the sample vertex shader:

attribute vec3 a_vec3a;
attribute vec3 a_vec3b;
attribute vec3 a_vec3c;
uniform vec3 u_rayPos;
uniform vec3 u_rayDir;
varying float v_result;
varying float v_ptA;
varying float v_ptB;
varying float v_ptC;
void main()
(performs logic to assign to v_result, v_ptA, v_ptB and v_ptC)

Vertex shaders work on a single vertex at a time, but the contents of each vertex is customizable and does not need to represent vertices of actual polygons. The logic in this example operates on three points at a time so three position attributes are part of this virtual “vertex”. A point and direction describing a ray is supplied as uniform variables since the ray does not change with each vertex.

Stream-out Varings

The stream-out varings file is a line-break-separated list of output variables similar to the following example:


In the example, the v_result, v_ptA, v_ptB, and v_ptC are output variables that are captured in the stream-out buffers. The gl_NextBuffer indicates that the following variables should go into a different buffers instead of interleaved into a single buffer. The stream-out variables are identified to the stream-out functions by their zero-based index in this file, which makes the ordering is important.

Setting Up a Stream-Out Shader

The attribute and uniform input data to a stream-out shader is set up like a rasterizing shader, but also requires a steam-out buffer to hold the stream-out data.

Setting Attributes

Attribute data is contained in a GX2RBuffer struct that acts as a vertex buffer. Initialize GX2RBuffer by using the GX2UTCreateVertexBuffer function or the more explicit function, GX2RCreateBuffer. The data may be copied into the buffer using the GX2UTFillBuffer function. The data may also be manually copied in after locking the buffer with the GX2RLockBuffer function.

Attribute streams are defined that pass the data from the buffers into the shader. Attribute streams are held in GX2AttribStream structs. Initialize attribute streams with GX2InitAttribStream. In the sample project, finding the attribute location in the vertex shader and initializing the attribute stream are wrapped in the DEMOGfxInitShaderAttribute demo library function.

Depending on the accuracy requirements of the function, the data may be compressed by changing the attribute types. If the shader accepts float values, the size of the buffer may be reduced by using data in float16, snorm, or unorm formats. The GPU converts these types of data into full 32-bit floating-point values before passing them into the shader.

Setting Uniforms

Uniform variables are initialized by using the GX2GetVertexUniformVarOffset function to get the index of the uniform variable. The resulting offset may be stored in a u32.

Initializing the Stream-Out Buffer

Stream-out data is stored in a GX2StreamOutBuffer struct, which contains a GX2RBuffer in the streamOutData member that receives the streamed data. Initialize it using GX2UTCreateBuffer.

To use the stream-out buffer, the correct GX2RResourceFlags must be set. Ensure that the stream-out buffer flag GX2R_BIND_STREAM_OUTPUT is set. In the sample, since the CPU needs read access only, set the GX2R_USAGE_CPU_READ flag. The example provides read and write access to the GPU by setting the GX2R_USAGE_GPU_READ and GX2R_USAGE_GPU_WRITE flags.

The GX2StreamOutBuffer also contains a context struct that should be allocated from the heap with GX2StreamOutContext size parameter and an alignment of GX2_STREAMOUT_CONTEXT_ALIGNMENT.

Rendering with a Stream-Out Shader

Most of the stream-out render loop looks like a typical render loop. The major difference is that, instead of sending the results of the vertex shader to the pixel shader, the results are captured into a stream-out buffer. Just as different rasterizing shaders may be used in a single render loop, rasterizing and stream-out shaders may be used in the same render loop.

Starting to Draw

Drawing with a stream-out shader is part of the overall render loop, which allows the rendering to start normally. Typically, this involves clearing the necessary color and depth buffers and setting the context state.

Stream-Out Pass

The attribute and uniform data are set as they would be for any other type of render pass by using GX2UTSetAttributeBuffer and GX2SetVertexUniform. The steam-out buffer and context need to be set using GX2RSetStreamOutBuffer and GX2RSetStreamOutContext. The vertex shader and fetch shader for the Stream-out shader are set using GX2SetVertexShader and GX2SetFetchShader. Since there are no pixel shaders in this pass, the rasterizer is disabled by calling GX2RSetRasterizerClipControl with GX2_DISABLE as the first argument. Stream-out is enabled by calling GX2SetStreamOutEnable using the GX2_ENABLE parameter.

A GX2Draw call starts stream-out process. Since the vertices are not turned into specific shapes, the shape type is not important. The sample sets the shape type to GX2_PRIMITIVE_POINTS.

Immediately after drawing, the stream-out context must be captured using GX2SaveStreamOutContext.

Finally, a GX2Flush call sends the buffered command list to the GPU. The stream-out commands will then have an associated timestamp which can be found by calling GX2GetLastSubmittedTimeStamp.

Draw Pass

To set up the render loop for drawing again, disable stream-out by calling GX2SetStreamOutEnable with GX2_DISABLE. The rasterizer must be reenabled by calling GX2SetRasterizerClipControl with GX2_ENABLE as the first argument.

The shader for the draw pass is set by using GX2SetShaders. The attribute and uniform data can be set using GX2UTSetAttributeBuffer and GX2SetVertexUniform. Since the stream-out commands are already committed to the command list, the contents of rasterizing draw passes that may follow have no effect on the stream-out.

The final step in the render loop is to request a swap of the color buffer using GX2SwapBuffers and flush all remaining commands with GX2Flush.

Accessing Stream-Out Data

The stream-out buffers may not be accessed until the stream-out logic completes on the GPU. To determine if the stream-out logic is finished, check whether the streamout command timestamp is retired by comparing the timestamp to the value returned by GX2GetRetiredTimeStamp.

When the GPU finishes writing the stream-out data, it may be accessed in the streamOutData member of the GX2StreamOutBuffer struct. This buffer is a standard GX2RBuffer, and the data may be accessed using GX2LockBuffer. Since the data only needs to be read, the best way to access the data is to use GX2LockBufferEx with the second argument of GX2_OPTION_LOCK_READONLY.

The GPU has a different byte ordering than the CPU. Data fed to the GPU using attributes or uniforms is automatically byte-swapped by the GX2 library functions so that it can be understood by the GPU. Since the stream-out data could be interpreted by the CPU or fed back into the GPU using a function such as GX2DrawStreamOut, the stream-out data is not automatically byte-swapped so that it can be understood by the CPU. Use GX2CopyEndianSwap to manually copy the data from the locked GX2RBuffer into a local buffer so that it can be interpreted by the CPU.


If the game is CPU bound, tasks can benefit from being offloaded to the GPU by using streamout shaders.


The streamout example demonstrates performing ray-triangle intersection on the GPU using streamout.

To run the demo:

  1. In cafe.bat:
    1. Change the CAFE_ROOT variable to point to the correct directory for the SDK with the Wii U CPU Profiler installed.
    2. If Cygwin is not installed at C:\cygwin, change the CYGWIN_PATH variable in cafe.bat to point to the correct directory.
  2. Ensure that the SDK and CAT-DEV are configured as specified in the CAT-DEV QuickStart Guide.
  3. Double-click cafe.bat
  4. At the command prompt, type cd $CAFE_ROOT/system/src/demo/gx2/streamout/mathSO.
  5. At the command prompt, type make run. The demo runs and prints the results to the monitor.
  6. Press CTRL+C to stop the printout loop.
  7. At the command prompt, type cafestop to stop the game.

Revision History

2013/08/05 Converted from PDF to HTML format.
2011/02/21 Initial version.