RedBeard's Dev Blog

GPU Profiling in XNA

Posted by redbeard on May 25, 2011

Everyone knows that when you have performance problems, you need to profile your application to find hot-spots and optimize them. XNA doesn’t give very good feedback about where time is spent; for CPU usage there are a number of tools and instrumentation samples which will tell you where time is being spent, but the GPU isn’t quite so blessed. You can use tools like PIX for Windows (PIX on Xbox is excellent but is only available to XDK developers), but most XNA developers care about Xbox performance where the tools are more limited; built-in instrumentation is possibly your best bet on Xbox. On Windows there are also options such as NVPerfHud which are vendor-specific but can tell you things that are hidden to all other programs & APIs.

For in-game instrumentation, we have to deal with whatever public APIs are exposed; what tools are available to use through XNA? D3D9 supports Timestamp queries which signal when they complete and record a timestamp for comparison against other events. XNA 4.0 Hi-Def exposes only Occlusion Query, which can apparently be hijacked to behave like Event queries, which will also signal when complete but without a recorded timestamp. That timestamp is essential in order to diagnose where time is being spent on the GPU, but we need to generate it on the CPU. If we just issue a bunch of queries throughout the frame and don’t try to check them until the end of the frame or the beginning of the next, there’s a good chance that a bunch of them will become signalled before you check them, and they’ll all get the same timestamp, so we need to poll the query continuously or at least frequently.

My approach, which I encapsulated in a helper class called GpuTrace:

  1. Use a queue to track all issued queries. The GPU will execute them in-order, so only the oldest one is worth checking for completion at any time.
  2. Track a named label with each issued query for diagnosis
  3. At frame start, start a stopwatch and issue an occlusion query (Begin and End in a pair)
  4. Issue an occlusion query before and after any major GPU milestone (Clear, scene.draw, post-proc), you can fence in a bunch of Draw calls with just one query
  5. After each query is issued, poll the front of the queue to see if it’s completed yet
  6. Whenever a query on the polling queue has completed, record the CPU stopwatch timestamp and put it aside, perhaps in a separate queue of timestamps
  7. At the end of the frame, spin-wait for all active queries to finish (perform the previous step on each)
  8. Examine the list of timestamps. The first one indicates the latency between issuing the query and seeing it complete. Deltas between two timestamps indicate the time spent by the GPU completing whatever activity was between them.
  • Keep in mind that adding all these queries, constantly polling them, and waiting for the GPU to finish at frame end will negatively impact your overall framerate, but the integrity of the data provided should be reliable.
  • You may also extend the occlusion query mechanism to separate the Begin/End calls, so the query will actually record the number of modified pixels during its execution.

Here is a screenshot of my in-game profilers, and another. The GPU profiling information gleaned from this approach is displayed in the upper right. The numbers displayed are the time since frame start until the noted event, and the time delta between an event and the previous event. Also visible on the left is a multi-threaded CPU profiler which I have borrowed and cleaned up from here; my cleanup was mostly focused on garbage collection. At the bottom of the screen is a framerate graph, which will show spikes and instability where a simple FPS number hides them; I’d like to do more work on profiling instrumentation to diagnose framerate spikes in particular.

Posted in XNA | Leave a Comment »

Garbage Control

Posted by redbeard on May 25, 2011

Garbage collection in XNA can be a big deal if you care about framerate hitching, which is visually jarring and can cause input polling problems. After a long enough time, any per-frame allocation will eventually be subject to garbage collection, which will pause all threads and search memory for live and dead objects to make space for more allocations. The garbage collector only has an opportunity to run if you allocate memory, therefore you can control when garbage collection happens by only performing allocations during expected downtime, such as during loading screens. The same wisdom applies for writing real-time applications in native code, but .NET provides more of a safety net if you do things quick & dirty at first; the GC will clean up your mess whereas a native app might just crash when running out of memory. Reducing and optimizing your allocations can also improve your loading times and reduce your minimum-spec hardware requirements.


The best tool I’ve used for diagnosing garbage allocation & collection is CLR Profiler. It has a few quirks: the allocation graph has some rendering bugs with the line traces, it has no option to “ignore this node”, and the histogram views don’t always allow enough resolution to see all the objects which have been allocated (ie if they’re less than 1KB).

With that said, the “histogram by age” view is quite useful for finding per-frame allocations; start up your game and get it into the desired state, then just let it run for a couple of minutes with no adjustments. After running a while, open up the “histogram by age” view and see if any allocations are younger than 1 minute. An option in the right-click context menu will even show you a visual call-stack of how those allocations happened. Note that if you exit your application, it will probably generate a bunch of garbage on shutdown, so it’s probably safe to ignore very-young allocations so long as the middle ground is clear.

Another useful view is the “allocation graph”, which will show you the list of all major allocations by basic type if you scroll all the way to the right, and the visual call-stack of how they were conjured as you look to the left. This view can be a little misleading if you have large chunks of one-time allocations for scratch memory pools, and there is no option to ignore or exclude specific nodes, but anything that bubbles up to the top should warrant investigation.

Major Sources of Garbage

… as discovered in my current codebase, with suggested corrective actions.

  • String manipulation:
    • Never use String.Format, for multiple reasons: hidden StringBuilder allocation, value types get boxed, hidden ToString calls allocate strings, and the params arguments create an array.
    • Never concatenate strings with + operator.
    • Never append non-string values onto a StringBuilder, these methods call ToString under the hood.
    • Never call StringBuilder.ToString, it allocates a string to return.
    • Do all string manipulation with a pre-allocated capacity-capped StringBuilder, use custom functions for numeric conversion like itoa & ftoa.
    • SpriteBatch can render StringBuilder objects without calling ToString.
    • I use a custom StringBuilderNG class (NG = no garbage) which wraps a standard StringBuilder and forwards only the “safe” methods which generate no garbage, and implements new methods for custom conversion of int & float values. This approach is more prone to bugs, but itoa and ftoa are relatively easy to implement.
  • DateTime.Now: replace with Stopwatch.Elapsed when used for profiling
  • params[] arguments: pre-allocate an array of appropriate size, and populate it immediately before the call. Don’t forget to null out the entries afterwards to avoid dangling references.
  • Value type boxing for IComparer<T> & IEnumerable<T>: implement CompareTo on the underlying type, and use an explicit value-type Enumerator like in List<T>. This also helps reduce virtual function calls.
  • Worker threads typically need scratch memory to operate on, this can be pre-allocated in a pool, and threads can grab a chunk of scratch memory when they start up and release it when they’re done

Other Resources

Posted in XNA | 3 Comments »

Deferred Shading in CubeWorld

Posted by redbeard on May 16, 2011

For my CubeWorld prototype, I wanted to try some screen-space effects like SSAO (screen-space ambient occlusion) and also compare the performance of deferred lighting versus standard forward-rendering lights; I’m also interested in just implementing a deferred renderer as I haven’t experimented with the concept before.

I found some good foundation code & explanation in the articles at, which got me started with some directional and point light functionality. I made a few modifications here and there such as combining multiple directional lights into a single pass and taking some liberties with the C# and shader code; I also used a procedural cube instead of a sphere mesh for my point light. I also found some good intro material in the NVidia deferred rendering presentation from “6800 Leagues Under the Sea”:, which includes a few optimizations which can help (if you’re not using XNA, I’ll get to that). The performance of the deferred lighting is quite good on my PC, although I haven’t tried it extensively on the Xbox.

After seeing the deferred shading in action, I wanted to make even more use of the G-Buffer for effects that can make use of it, and one of the primary effects I’m interested in is SSAO, because the cube world looks rather artificial with all the faces shaded relatively flatly. I implemented the SSAO shader described in a article, which provides dense and somewhat unintuitive code, but it works and the rest of the article explains the concepts used. The article offered little guidance for tweaking the 4 input parameters such as “bias” and “scale”, but I found some numbers which appeared to work, and named them more intuitively for my internal API. I’m currently using only a 9×9 separated blur rather than the 15×15 suggested in the article. The effect works, but the screen-space random field is plain to see, and it seems to be more pronounced on distant geometry; I can probably do some more work to try and resolve those artifacts. A much more distracting artifact is the total loss of ambient occlusion at the edges of the screen in certain condition, I’m not sure if there’s a reasonable solution for that. I may try some static AO calculations for each cube face to see if I can get stable results that way.

The overall flow of my deferred renderer, currently (1 or more “passes” per step below):

  1. Render all scene geometry into G-Buffer
  2. Generate noisy SSAO buffer at half-resolution
  3. Blur SSAO buffer horizontally and vertically at full-resolution
  4. Accumulate directional and point lights, one per pass
  5. Combine lighting, albedo, and SSAO into final image

Some issues I ran into when implementing my deferred shading in XNA:

  • XNA does not allow you to separate the depth-buffer from a render-target, which means you cannot use the stencil optimization for light volumes as discussed in the NVidia “6800 Leagues” presentation. The optimization allows you to only light-shade the pixels which are within the light volume, rather than all the ones that may be covered by it but are too distance to be affected. This requires that you retain the depth buffer from the geometry pass and use it to depth-test and store stencil values for light geometry, and then use those stencil values for a different render-target, specifically the light accumulation buffer.
  • Xbox 360 has 10MB of framebuffer memory linked to the GPU, which works great if you’re rendering a single 1280×720 render-target and depth-buffer at 4 bytes each (about 7MB). When you want 3 rendertargets and a depth-buffer, you can either “tile” the framebuffer and re-draw the geometry multiple times, or you can drop the resolution until all the buffers fit; I opted for the latter option, using 1024×576 (for 16:9). XNA doesn’t expose the ability to resolve the depth-buffer to a texture, which means you must include your own depth render-target in your G-Buffer, or else that target resolution could be increased. On PC, the memory limitation is lifted, but you still can’t read back depth via D3D9 so the extra buffer still applies.
  • I can see visible banding on my point lights, I’m not sure if this is due to banding in the light buffer itself or the final compositing. XNA 4.0 exposes the HdrBlendable format, which on Xbox uses a 10-bit floating-point value per component, but with only 7 bits of mantissa I’m not convinced it offers any reduced banding from 8-bit fixed-point components, just a different pattern.

Screenshots of my results:

  • Directional and point lights: screenshot (debug display shows albedo, depth, normals, and lighting)
  • SSAO random samples before blurring: screenshot (slightly more noisy than it should be, due to non-normalized random vectors)
  • SSAO after blurring: screenshot
  • Comparison images from before deferred shading was implemented: shot 1, shot 2

Other resources I came across while implementing these things:

Posted in CubeFortress, XNA | Leave a Comment »

Threading on Xbox360 with XNA

Posted by redbeard on May 15, 2011

I discovered recently that threads on the Xbox are not managed or scheduled across CPU cores automatically. You must call SetProcessorAffinity from within the target thread, and that thread will run only on the specific core (hardware thread) from then onwards. I wrote some helper code at the top of my worker-thread management function (m_Threads is the list of potentially active thread objects). My “do thread work here” implementation just spins waiting for work to arrive in a queue.

private static void DoThreadWork(object threadObj)
  Thread myThread = threadObj as Thread;
#if XBOX
  // manually assign hardware thread affinity
  lock (m_Threads)
    // avoid threads 0 & 2 (reserved for XNA), and thread 1 (main render thread)
    int[] cpuIds = new int[] { 5, 4, 3};
    for (int n = 0; n < m_NumActiveThreads; ++n)
      if (m_Threads[n] == myThread)
        int cpu = cpuIds[n % cpuIds.Length];
  // do actual thread work here

Posted in XNA | Leave a Comment »

Year To Date

Posted by redbeard on May 15, 2011

  • March 2011: “Marksman: Long Range” released, see postmortem.
  • April 2011: Started working on TileTactics, a top-down tile-based tactical game concept inspired by X-COM. Got some basic functionality working: tile visibility for units, path-finding, some primitive UI state machine stuff. Put this project on the back burner because I wasn’t feeling terribly inspired for the overall vision of the project, because I didn’t want to make a straight-up X-COM clone and was concerned for the viability of the project on the Xbox Live marketplace.
  • April 2011:  Started on CubeWorld (working title), inspired by Minecraft to create a cubular world where the geometry is structured but random and infinite. This is my currently active project, so I’ll write some more articles about the details.

Posted in CubeFortress, Marksman, TileTactics, XNA | Leave a Comment »