GPU7 Hardware Overview

Cafe's graphics processing unit, known as GPU7, is based on the Radeon series of graphics processors targeting Direct3D version 10.1 and OpenGL 3.3.

GPU7 High-Level Features

  • Unified shader architecture executes vertex, geometry, pixel, and compute shaders
  • Multi-sample anti-aliasing (2, 4, or 8 samples per pixel)
  • Read from multi-sample surfaces in the shader
  • 128-bit floating point HDR texture filtering
  • High resolution texture support (up to 8192 x 8192)
  • Indexed cube map arrays
  • 8 render targets
  • Independent blend modes per render target
  • Multi-sample masking
  • Hierarchical Z/stencil buffer
  • Early Z test and Fast Z Clear
  • Lossless Z & stencil compression
  • 2x/4x/8x/16x high quality adaptive anisotropic filtering modes
  • sRGB filtering (gamma/degamma)
  • Tessellation unit
  • Stream out support
  • Compute shader support

GPU7 Configuration

The following table shows the GPU7 configuration. The definitions for each configuration are described in the subsequent sections.

SIMDs 2
Quad Pipes 4 per SIMD
ALUs 32 (2 SIMD * 4 Quad Pipe * 4 ALU per Quad Pipe)
Wavefront width 64 (4 Quad Pipe * 4 ALU per Quad Pipe * 4 cycles per instruction)
Stream Processors 160 (32 ALU * 5 Stream Processors per ALU)
GPRs 256 vec4s (4 x 32-bit components) available per wavefront work item, totaling 32K vec4s (256 registers * 64 work items in a wavefront * 2 SIMD)
Render Backends 2
Texture Pipes 2
Texture & Vertex Cache 2 x 8 KB L1, 2x 32 KB L2
Vertex Reuse History Buffer Up to 16 entries
SIMD Local Data Store 16 KB
SIMD Global Data Store 16 KB

GPU7 Resource Limits

Vertex Attribute Buffers 16 (buffers contain many interleaved streams)
Vertex Attribute Streams 32
Texture Samplers (per shader type) 18 (only 16 supported in GX2)
Texture Resources (per shader type) 128 (only 18 supported in GX2)
Max Texture Size 8Kx8K
Max 3D Texture Size 8Kx8Kx8K
Max Texture array Size 8Kx8Kx8K
Max Cube Map Size 8Kx8Kx1635
Texture Anisotropy 2, 4, 8, 16
Constant buffers (per shader type) 16
Constant Registers (VS and PS only) 256 Vectors
Max Constant Buffer Size 4096 Vectors (16K scalar values)
MRTs 8
MSAA Samples 2, 4, or 8
Viewport Scissor Rectangles 16 (only 1 supported in GX2)

GPU7 Block Diagram

GPU7BlockDiagram.jpg

GPU7 Blocks

Command Processor (CP)

  • Main GPU7 interface from the CPU
  • Parses commands from the ring buffer and command buffers
  • Handles state management
    • Shadows register updates out to memory
    • Restores register data from memory
  • Supports 2-levels of command buffer indirection
  • Handles surface synchronization and memory semaphores
  • Handles GPU7 progress feedback (timestamps, asynchronous query results, etc)

Vertex Grouper Tessellator (VGT)

  • Fetches vertex index data from memory
  • Groups index data into primitives
  • Determines if a vertex can be re-used from a previous primitive
    • Compares against 14 previous unique index values if the GS is not enabled
    • Compares against 16 previous unique index values when the GS is enabled
  • Sends each unique vertex to the SPI for execution by the vertex shader
  • Sends output indices that form the primitive to the PA
  • Performs fixed function tessellation
    • Subdivides primitives based on the tessellation factor and mode
    • Supports discrete, continuous, and adaptive tessellation modes

Primitive Assembly (PA)

  • Performs clip plane testing (trivial reject, trivial accept)
  • Transforms each vertex from clip space into screen coordinates
  • Clips the primitive to the frustum planes and user-defined planes (if enabled)
  • Performs back-face culling
  • Calculates barycentric coordinates for the primitive

Scan Converter (SC)

  • Determines pixel coverage of the primitive
  • Performs scissoring
  • Groups pixels into 8x8 tiles (16 pixel quads)
  • Sends per-tile Z data to the HiZ block of the RB for HiZ occlusion testing

Shader Parameter Interpolator (SPI)

  • Interpolates attributes for each pixel of the primitive
  • Allocates the GPRs necessary for shader execution
  • Loads the input GPRs
  • Selects the SIMD used for shader execution

Sequencer (SQ)

  • Fetches the shader instructions from memory
  • Parses the shader instructions
    • ALU instructions are sent to the SIMD
    • Shader Export instructions (outputs from the shader) are sent to the SX
    • Texture fetch instructions are sent to the TEX and TC
  • Fetches shader constants from the register file or memory (if constant buffers are enabled)

Wavefront

  • GPU7 batches up work items into wavefronts before processing them
  • A shader execution wavefront has 64 work items
  • A work item is a single vertex, pixel, or primitive
  • Each wavefront has a single PC and execution state

Single Instruction Multiple Data Processor (SIMD)

  • Each SIMD consists of GPRs, ALUs, control and I/O
  • ALUs are arranged into "Quad Pipes" (4 per SIMD)
    • The quad pipe is so named because when it is processing pixels, they are typically from a 2x2 block on screen
  • Each quad pipe contains 4 VILW5 ALUs
  • Each VLIW5 ALU contains 4 "thin" stream processors and 1 "fat" processor
    • All processors perform scalar math instructions on 32 bit floats.
    • The "thin" processors execute a small set of instruction types such as addition and multiplication, and are typically used for handling vec4 data.
    • The "fat" processor includes the "thin" features plus hardware transcendentals such as cosine and logarithm and more
    • The SQ is smart enough to be able to schedule a vec4 operation and scalar operation simultaneously, but that's the best case; sometimes just a single vec4 or even a single scalar instruction is executed at a time.
  • Instructions are completed on wavefronts in 4 cycle increments
    • 64 work items per wavefront / (4 Quad Pipe * 4 ALU per Quad Pipe) = 4 cycles per instruction on a wavefront
  • Multiple wavefronts may be held in a "runnable" state in a SIMD at one time
    • The number of runnable wavefronts depends on shader GPR usage.
    • When the running wavefront hits a memory access another runnable wavefront may be run to hide latency
  • Has dedicated texture block
    • Each SIMD has its own L1 cache
    • All SIMDs share an L2 cache
  • Each SIMD has local data store which enables sharing of data between work-items within a wavefront. This is accessible from compute shaders, see GX2 Compute Shader Extension.
  • GPU7 has a global data store which allows for compute shaders to share data between SIMDs. This has not been enabled yet due to several hardware restrictions.
  • SIMDs export to the SX which will store the data for the next pipeline stage or write it out to memory.

Shader Export (SX)

  • Stores shader output data for the next stage
    • position and attribute data from the vertex shader
    • pixel color from the pixel shader
  • Handles memory reads and writes from within the shader
    • stream out, scratch, GS ring buffers, scattered write, etc.

Texture Pipe (TEX)

  • Computes the texture coordinates and mip level
    • 1D, 2D, 3D, texture arrays, and cube maps
  • Performs texture filtering
    • Bilinear, Trilinear and Anisotropic filtering

Texture Cache (TC)

  • A memory cache to reduce granularity losses of external memory requests
  • Provides very high bandwidth access to texel data
  • Computes the texture address (including surface tiling) from the texture coordinates
  • Handles decompression for compressed textures

Render Backend (RB)

  • Performs per-sample operations
    • Z calculations
    • Z and stencil testing
    • alpha blending
  • Writes depth and the color data out to memory

Memory Controller (MC)

  • Handles all memory accesses for GPU7 and the CPU

GPU7 Functional Details

Hierarchical Z

  • The HiZ block computes a min and max Z value for each 8x8 pixel tile
  • The computed Z range is compared against the Z range in the HiZ buffer
  • If the HiZ test result is unknown a per pixel Z test is performed
    • If the HiZ test passes, the tile is trivially accepted
    • If it fails, the tile is rejected
  • Per-pixel Z test can be performed before the pixel shader (early Z) or after the pixel shader (late Z)
    • Early Z can only be used if the pixel shader does not discard pixels or export Z, and if alpha test is not enabled

Hierarchical Stencil

  • The HiStencil block contains 2 pretest states tested for each 8x8 pixel tile
  • The stencil reference value is compared against the pretest results in the HiS buffer
  • If the HiS test result is unknown a per pixel stencil test is performed
    • If the HiS test passes, the tile is trivially accepted
    • If it fails, the tile is rejected

Texture Cache

  • The texture cache is split into two levels an L1 cache and L2 cache
  • The L1 cache is per texture pipe (GPU7 has 2 texture pipes)
  • The L2 cache is per memory channel (Cafe has 2 system memory channels)
  • Cache lines are 64 bytes for the L1 and L2 cache
  • Texture decompression occurs after reading the L1 cache before sending the data to the texture pipe

Memory

  • Cafe has 2 different types of memory that can be used for graphics
    • 32MB of High bandwidth EDRAM (MEM1)
    • 2GB of DDR3 system memory (MEM2)
  • MEM1 should be used for render targets, depth buffers, and auxiliary buffers
  • MEM2 should be used for all other graphics surfaces
    • Textures, vertex buffers, display buffers, etc.
  • For more information on memory usage, see Basic Memory Allocation

Shader Engine

GPU7SIMDDiagram.jpg
  • GPU7 uses 5-way Very Large Instruction Word (VLIW) ALUs
    • 5 separate scalar operations can be executed by an ALU in a single clock, 1 operation per stream processor in the ALU.
  • 256 GPRs available to each stream processor but shared between all shader types/runnable wavefronts
    • Reconfigurable fixed allocation, see GX2SetShaderModeEx
    • GPRs can hold data in one of the following formats:
      • 32 bit IEEE floats
      • 32 bit signed/unsigned integers
      • 64 bit IEEE doubles (reduced rate)
    • All ALUs can read data from any GPR
  • Address register (AR) for relative addressing of constants and GPR indexing
  • Shader execution is broken up into clauses controlled by a macro sequencer
    • A clause is a sequential list of instructions of the same type
      • E.g. ALU, texture fetch, or vertex fetch.
  • ALU constants can be read from memory (constant buffers) or a register file
    • Constant buffers
      • Constant register file cannot be used with constant buffers
      • Each ALU clause can access up to 2 banks of 16 constants

GPU7 Walkthrough

Vertex Path

  1. The CP instructs the VGT to start working on a new vertex list
  2. The VGT fetches triangle indices from memory (if necessary).
  3. The VGT assigns output indices for all non-duplicate vertices. Duplicate vertex work is minimized through tracking a history buffer (up to 14).
  4. Non-duplicate vertices are sent to the SPI. Triangles (with their output indices) are sent to the PA.
  5. Once a block of 64 vertices has been received, the SPI creates a vertex thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPRs with input data.
  6. The SQ parses the shader instructions associated with the vertex thread, sending commands to the appropriate SIMD, SX, TC, and TEX units as necessary.
  7. Completed vertex data is written into the parameter and position buffers at the appropriate (output) indices.

Vertex Path w/ Geometry Shaders

  1. The CP instructs the VGT to start working on a new vertex list
  2. The VGT fetches triangle indices from memory (if necessary).
  3. The VGT assigns output indices for all non-duplicate vertices. Duplicate vertex work is minimized through tracking a history buffer (up to 16).
  4. Non-duplicate vertices are sent to the SPI.
  5. Once a block of 64 vertices has been received, the SPI creates an export thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPRs with input data and the output ring buffer offset.
  6. The SQ parses the shader instructions associated with the export thread (vertex shader), sending commands to the appropriate SIMD, SX, TC, and TEX units as necessary.
  7. Completed vertex data is written into a ring buffer in memory.
  8. The VGT assembles primitives that have been written to the ring buffer and sends them to the SPI
  9. Once a block of 64 primitives has been received, the SPI creates a geometry thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPRs with ring buffer offsets for each vertex of the primitive.
  10. The SQ parses the shader instructions associated with the geometry thread, sending commands to the appropriate SIMD, SX, VC, and TEX units as necessary.
  11. Completed primitive data is written into a ring buffer in memory.
  12. Vertices output by the geometry shader are sent to the SPI. Triangles (with their output indices) are sent to the PA.
  13. Once a block of 64 vertices has been received, the SPI creates a vertex thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPRs with the ring buffer offset for each vertex.
  14. The SQ executes shader instructions to move the vertex data from the ring buffer to the parameter and position buffers at the appropriate (output) indices.

Pixel Path

  1. PA fetches processed vertex positions from the position buffer and assembles triangles.
  2. The PA applies clipping and perspective transform, passing the resulting triangles to the SC.
  3. The SC subdivides each triangle into its component set of quads (2x2 pixels), and assigns barycentric values to each pixel.
  4. The SPI fetches processed vertex attributes (color, texture coordinates, normals etc.) from the parameter buffer, and generates interpolated values based on the barycentric coordinates.
  5. Once a block of 64 pixels (16 quads) has been received, the SPI creates a pixel thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPRS with interpolated attribute values.
  6. The SQ parses the shader instructions associated with the pixel thread, sending commands to the appropriate SIMD, SX, VC, and TEX units as necessary.
  7. Completed pixel data is written to the RBs via the SX. The RBs apply the appropriate Z and blend operations, and write the resulting colors into the frame buffer.

Compute Shader Path

  1. The CP instructs the VGT to start working on a new set of work-items
  2. The VGT outputs a unique identifier for each work-item in a compute dispatch call and sends it to the SPI
  3. Once a block of 64 work-items has been received (one wavefront), the SPI creates an export thread, assigns it to a specific SIMD based on a round robin assignment, and initializes its GPR with its unique identifier
  4. The SQ parses the shader instructions associated with the export thread (compute shader), sending commands to the appropriate SIMD, SX, TC, and TEX units as necessary.
  5. The VGT generates a fake primitive to the PA to be discarded.

CONFIDENTIAL