Network Optimization on the Wii U

The following describes techniques to improve overall network throughput using the Wii U network API along with common pitfalls to avoid with its use. Techniques are described for large file downloads using TCP sockets and the Wii U LibCurl implementation and for performance improvements for UDP sockets used with low-latency gaming. All described optimization methods apply to both TCP and UDP except where otherwise noted.

Optimal Socket Architecture

Taking a top-down approach, there are a few things to note when building the network engine for your game on the Wii U. The first and most important thing is to have networking code on a separate thread. The Wii U supports non-blocking sockets in that calls to SOSend and SORecv can return SO_EWOULDBLOCK, but these calls still average about one to two milliseconds to return. There are rare cases in which they take much longer. As a result, having any socket operations on the main thread is decidedly a poor choice.

With awareness of this per-call latency, there is no reason to use non-blocking sockets on the Wii U. You should use SOSelect with blocking sockets to ensure that as little time as possible is spent in the I/O call itself.

Rapid polling of socket I/O functions, such as in a tight loop, may clog the I/O pipeline and severely degrade system performance. A safe and overall reasonable network I/O threaded model polls all sockets using one SOSelect call without a timeout. This ensures that no calls to the I/O subsystem are wasted, and still allows data to be processed as soon as it arrives.

As long as have your socket I/O is on a separate thread, having this thread located on Core 2 is a better choice. Core 2 of the CPU always belongs to your application, even while it is paused and the user is in the HOME Menu. This means that as long as you keep your socket I/O calls isolated to Core 2, you should not have problems with socket timeouts or disconnects while your application is in the background.

Minimizing Per-Call Latency

The best method of minimizing per-call latency on socket I/O calls is to use buffers that are aligned in both size and address with cache lines, such as with the #define PPC_IO_BUFFER_ALIGN. Using misaligned buffers causes the average latency for send and receive operations to increase to about 2 milliseconds. Using aligned buffers minimizes the latency to an average of 1 millisecond per call. If you notice dramatic spikes in the latency of send and receive operations, the most likely cause is putting too much stress on the I/O subsystem, such as making I/O calls in a tight loop. This is not exclusive to socket I/O. Any I/O functions, including file operations, may result in spikes in the call latency. For instance, while calling SORecv in a tight loop, spikes up to 33 milliseconds have been noted.

To mitigate some of the per-call latency, beginning in SDK 2.09 the SOSendToMulti function is included. This function may be used with UDP sockets to simultaneously send the same data to multiple peers. If you are sending the same data on multiple sockets and using UDP, SOSendToMulti allows you to do the same work with less stress on the I/O subsystem and only one I/O call delay of about one millisecond.

Beginning in SDK 2.11.0 the SOSendToMultiEx function is included. It functions similarly to SOSendToMulti except it allows you to send different data packets to different peers.

Maximizing Network Throughput

There are two primary strategies for increasing overall network bandwidth on the Wii U system.

Some of the options only apply to TCP or LibCurl, which are noted.

Enable high performance socket options

If using the socket API directly, the following options may improve network performance.

  1. Increase the internal buffer size

    By default, all sockets on the Wii U have two 8-kilobyte (KB) internal buffers for the SOSend[To] and SORecv[From] operations for TCP sockets, and 1492 bytes and 4 KB in size respectively for UDP sockets. Increasing this buffer size is the easiest way to improve overall performance. The maximum size for the buffer is 65535 bytes, unless the TCP window scaling option is already enabled, in which case the maximum size is 128 KB. Because the current buffer sizes are factored in during the initial TCP handshake, we recommend that you set your buffer sizes before you begin data transfer. It is also important to remember that you should not pass a buffer to SOSend[To] and SORecv[From] larger than the size of the internal buffer. You do not get more data from the socket in a single call than the amount of data the internal buffer can hold.

    NOTE:
    With UDP sockets, it is not necessary to increase the size of the send buffer. UDP packets are copied directly to the network stack on send. A larger send buffer size does not have any effect on performance.
    • Benefits:

      While some socket implementations support buffer sizes in the range of megabytes, the Wii U uses a more memory-efficient approach. Since the Wii U buffers are so small, any data left in them begins decreasing the window size of the TCP stream, which results in slower downloads. A larger buffer means that you may receive more data off the socket with each call, which dramatically increases performance. In UDP terms, a larger buffer size means that you may have more packets waiting on the socket before you call SORecv[From] without dropping any packets.

    • Pitfalls:

      The buffers used for sockets are allocated from an internal I/O heap, not from main system memory. The amount of space on this heap is limited. As you approach its upper bounds, you will begin to receive SO_ENOMEM and SO_ENOBUFS errors. There is no hard limit on what causes these errors, but if you have 4 sockets, all with 64 KB send and receive buffers, you will probably start to see them. The only solution here is to tune your send and receive buffers on a per-socket basis to exactly match your needs.

    • SOSocket Code:
      int bufSize = 64 * 1024 - 1;
      ret = SOSetSockOpt(sock, SOL_SOCKET, SO_RCVBUF, &bufSize, sizeof(bufSize));
      if(ret != 0)
        OSReport("Failed to set recv buffer size: %d\n", SOLastError());
      
      ret = SOSetSockOpt(sock, SOL_SOCKET, SO_SNDBUF, &bufSize, sizeof(bufSize));
      if(ret != 0)
        OSReport("Failed to set send buffer size: %d\n", SOLastError());
      
  2. Turn on TCP selective acknowledgement

    Selective acknowledgment (SACK) allows the TCP stream to acknowledge discontinuous streams of data. This option is not supported on UDP, but is enabled by default for TCP sockets in SDK 2.08 and later.

    • Benefits:

      Suppose a server sends you 16 KB of data across 8 packets. If the first packet containing the first 2 KB were lost, selective acknowledgment allows the Wii U to tell the server to resend only the first 2 KB packet, rather than the entire 16 KB chunk of data. In situations with packet loss, this is highly beneficial.

    • Pitfalls:

      To be used, selective acknowledgment must be supported on the server as well as the Wii U. This should not be a problem because all popular TCP stacks support selective acknowledgement. Since TCP requires data to be delivered in order, use of SACK may put a strain on your buffer size because it has to store all out-of-order data delivered until the missing packet is retrieved. Until the buffer has all of the data in order, SORecv[From] cannot return any additional bytes.

    • SOSocket Code:
      int sackOn = 1;
      ret = SOSetSockOpt(sock, SOL_SOCKET, SO_TCPSACK, &sackOn, sizeof(sackOn));
      if(ret != 0)
      	OSReport("Failed to enable TCP SACK: %d\n", SOLastError());
      
  3. Enable TCP window scaling

    Window scaling on the Wii U allows a TCP server to send up to 128 KB at a time to the Wii U – twice the default limit. This option is not supported on UDP sockets and may be enabled only before establishing a connection.

    • Benefits:

      In an environment with a large round-trip time, window scaling allows the server to send more data between received acknowledgements. This means that the server spends less time idling as it waits for the data it sent to be acknowledged, which helps with overall efficiency.

    • Pitfalls:

      The most common pitfall when using window scaling is forgetting to enable it before you set the sockets internal buffer size. If the window scaling option is not enabled and you attempt to set a buffer size to over 65535 bytes, either you receive an error or the value is silently capped. It is also important to note that with 128 KB buffers, you quickly run out of available space on the I/O heap.

    • SOSocket Code:
      int winScale = 1;
      ret = SOSetSockOpt(sock, SOL_SOCKET, SO_WINSCALE, &winScale, sizeof(winScale));
      if(ret != 0)
          OSReport("Failed to enable window scaling: %d\n", SOLastError());
  4. Enabling socket options via LibCurl

    You may also set the high performance socket options via LibCurl. You do so via a callback passed to curl_easy_setopt, like so:

    static size_t curl_sockopt(void *pData, curl_socket_t sock, curlsocktype use)
    {
    	int ret;
    
    	int winScale = 1;
    	ret = SOSetSockOpt(sock, SOL_SOCKET, SO_WINSCALE, &winScale, sizeof(winScale));
    	if(ret != 0)
    		OSReport("SO_WINSCALE Error: %d\n", SOLastError());
    
    	int sackOn = 1;
    	ret = SOSetSockOpt(sock, SOL_SOCKET, SO_TCPSACK, &sackOn, sizeof(sackOn));
    	if(ret != 0)
    		OSReport("SO_TCPSACK Error: %d\n", SOLastError());
    
    	int bufSize = 128 * 1024;
    	ret = SOSetSockOpt(sock, SOL_SOCKET, SO_RCVBUF, &bufSize, sizeof(bufSize));
    	if(ret != 0)
    		OSReport("SO_RCVBUF Error: %d\n", SOLastError());
    
    	return 0;
    }
    
    // Call this when setting up curl options
    curl_easy_setopt(curlHandle, CURLOPT_SOCKOPTFUNCTION, curl_sockopt);

Socket options: performance results

Splitting Transfer Operations

While downloading or uploading a large amount of content, it may help to split the transfer operation among multiple connections. To do this, create multiple sockets by using the SOSocket API or by making multiple HTTP or HTTPS requests using the LibCurl API. This method of optimization applies only to data streaming, such as with TCP and LibCurl.

LibCurl Example

We have a 200MB (209715200 Bytes) file on a server. Instead of making one HTTP request of the entire file, we make two requests of different sections of the file.

If the client does not know the size of the file, it can request the resource with no body and only the header is returned. This allows us to read the header and know the size of the file. The server may not know the size of the resource, for example, if the content of the resource is generated. For the following example, we assume that the server always knows the size of the resource (file).

  1. Requesting the file size
    BOOL GetFileSize(unsigned* pSize, const char* url)
    {
        BOOL bRet = FALSE;
        CURL *curl = curl_easy_init();
        if(curl){
    		double dSize = 0.0;
    		curl_easy_setopt(curl, CURLOPT_URL, url);
    		curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1);
    	
    		// This is the KEY option that tells the server not
    		// to send to body(the actual content).
    		curl_easy_setopt(curl, CURLOPT_NOBODY, 1);
    	
    		curl_easy_perform(curl);
    		curl_easy_getinfo(curl, CURLINFO_CONTENT_LENGTH_DOWNLOAD, &dSize);
    		if( dSize < 0 )
    			OSReport(" * Unknown content length!\n");
    		else
    		{
    			*pSize = (unsigned)dSize;
    			OSReport("*******Got File size! %u\n", *pSize);
    			bRet = TRUE;
    		}
    
    		curl_easy_cleanup(curl);
        }
    
        return bRet;
    }	
  2. Making requests for file chunks
    nChunk = nFileSize / nNumDownloads;
    nRemainder = nFileSize % nNumDownloads;
    
    CURL *curl_easy_handle;
    for(int easyCount = 0; easyCount < nNumDownloads; easyCount++){
      curl_easy_handle = curl_easy_init();
    
      curl_easy_handles[easyCount] = curl_easy_handle;
    
      curl_easy_setopt(curl_easy_handle, CURLOPT_URL, url);
      curl_easy_setopt(curl_easy_handle, CURLOPT_FOLLOWLOCATION, 1);
    
      // Set improved socket options
      curl_easy_setopt(curl_easy_handle, CURLOPT_SOCKOPTFUNCTION, curl_set_sock_opts);
    
      unsigned nInc;
      if( nRemainder > 0 ){
        nInc = nChunk + 1;
        --nRemainder;
      }else
        nInc = nChunk;
    
      char pBuffer[32];
      snprintf(pBuffer, 32, "%u-%u", nPrev, nPrev + nInc - 1);
      curl_easy_setopt(curl_easy_handle, CURLOPT_RANGE, pBuffer);
    
      nPrev += nInc;
    
      OSReport("Chunk: %u  %s\n", nChunk, pBuffer);
    }
  3. Processing requests
    // init a multi stack
    curl_multi_handle = curl_multi_init();
    
    // add the individual transfers
    for(int i = 0; i < curl_easy_handle_count; i++)
      curl_multi_add_handle(curl_multi_handle, curl_easy_handles[i]);
    
    CURLMcode status;
    int still_running;
    
    OSReport("Downloading...\n");
    do {
      status = curl_multi_perform(curl_multi_handle, &still_running);
    } while(status == CURLM_CALL_MULTI_PERFORM);
    
    while(still_running) {
      struct timeval timeout = { 0, 0 };
      int rc; // select() return code
    
      fd_set fdread, fdwrite, fdexcep;
      int maxfd = -1;
      long curl_timeo = -1;
    
      FD_ZERO(&fdread);
      FD_ZERO(&fdwrite);
      FD_ZERO(&fdexcep);
    
      curl_multi_timeout(curl_multi_handle, &curl_timeo);
      if(curl_timeo >= 0){
        timeout.tv_sec = curl_timeo / 1000;
        timeout.tv_usec = (curl_timeo % 1000) * 1000;
      }
    
      //Get file descriptors from the transfers
      curl_multi_fdset(curl_multi_handle, &fdread, &fdwrite, &fdexcep, &maxfd);
      rc = select(maxfd+1, &fdread, &fdwrite, &fdexcep, &timeout);
    
      switch(rc) {
        case -1: // select error
          still_running = 0;
          OSReport("ERROR: select() returned error\n");
        break;
        case 0:
        default: // timeout or readable/writable sockets
          do {
            status = curl_multi_perform(curl_multi_handle, &still_running);
          } while(status == CURLM_CALL_MULTI_PERFORM);
        break;
      }
    }
  4. Evaluating results
    for(unsigned h = 0; h < nNumDownloads; ++h)
    {
      int nRespCode = 0;
      curl_easy_getinfo(easy_handles[h], CURLINFO_RESPONSE_CODE, &nRespCode);
      OSReport("CURLINFO_RESPONSE_CODE: %d\n", nRespCode);
    
      curl_easy_getinfo(easy_handles[h], CURLINFO_HTTP_CONNECTCODE, &nRespCode);
      OSReport("CURLINFO_HTTP_CONNECTCODE: %d\n", nRespCode);
    }
  5. Cleaning up
    for (unsigned i = 0; i < curl_easy_handle_count; i++) {
      curl_multi_remove_handle(curl_multi_handle, curl_easy_handles[i]);
      curl_easy_cleanup(curl_easy_handles[i]);
    }
    
    curl_multi_cleanup(curl_multi_handle);
    curl_global_cleanup();

Splitting transfer operations: performance results

The results from this test may be achieved by using the sample demo/speed_test/speed_test.cpp

Notes on Wi-Fi Testing

Many development environments have poor wireless performance with high traffic and multiple routers competing for air time. This is quite dissimilar to the environment found in most homes. However, the only way to obtain maximum wireless throughput in a development environment is to conduct tests inside a Faraday cage. If you do not have one available, to obtain the best results for your environment, use the wireless channel with the lowest traffic for testing and ensure that the routers are optimally spaced.

Wireless Channels and Congestion

The 802.11g standard that is implemented by the Wii U has 14 channels, with a good deal of overlap in these channels. Nintendo recommends that you use only channels 1, 6, or 11. The key reason for this recommendation is that communication to devices on overlapping channels cannot occur simultaneously, even between routers. For example, if you have three routers, one on channel 6, one on channel 11, and one on channel 8, the router on channel 8 must wait for any client communication with the routers on channels 6 and 11 to finish before it sends or receives data. This delay has a noticeable effect on performance.

The factory default for most routers is channel 6. This means that channel 6 is probably the most congested channel in your environment. The best way to test this is to use a spectrum analyzer, which may be found online as freeware for a PC/Mac or as a cell phone application. A spectrum analyzer shows you the frequency ranges that have the least amount of traffic.

Physical AP Placement

Ideally, the signal radius of a router does not overlap any other router on the same channel. The signal radius of a router may be controlled by the power output of the antenna. A signal strength of about -90dBm may be considered the cutoff for the radius of the antenna. As mentioned earlier, overlapping signal radii on the same channel compete for bandwidth. Only one device on a single channel may communicate at a time.

Final Note

By using the available Wii U hardware and operating system specific network optimizations, you may obtain reasonable network performance that is adequate for modern low-latency gaming.

Appendix

The sample code in the demo performs a Libcurl multi-interface HTTP performance test. The demo may be found in $CAFE_ROOT/system/src/demo/speedtest/speedtest.cpp. The demo overview maybe found here.

Revision History

2014/05/14 Updated test results ran from SDK 2.11.06.
2013/07/15 Convert PDF files to HTML.


CONFIDENTIAL