RedBeard's Dev Blog

Archive for May, 2011

JIT Optimizations on Xbox

Posted by redbeard on May 27, 2011

I came to a sickening realization last night: the JIT optimization performed by the .NET compact framework on Xbox is pitifully weak. Some of the most basic optimizations I’ve come to expect from a decade of C++ programming and “trust the compiler” advice are not present. This realization has the potential to kill my CubeWorld project, if it prevents me from reaching my perf goal on Xbox. Read the rest of this entry »

Posted in CubeFortress, XNA | 1 Comment »

Static Ambient Occlusion in CubeWorld

Posted by redbeard on May 26, 2011

I was pleased with the visual impact of screen-space ambient occlusion (SSAO) in my deferred shading system, but I felt that there were two primary artifacts that were too big to ignore and which mean that SSAO is inappropriate for this project: 1) screen-space noise was quite noticeable despite the psuedo-gaussian blurring step, and 2) the ambient occlusion disappears at the edges of the screen when the occluding surface moves off-screen. I wanted ambient occlusion which was more stable and perhaps a bit less expensive to render, after all I’m just rendering a bunch of cubes!

For CubeWorld, static ambient occlusion appears to improve on the flaws of SSAO, without too many drawbacks. For non-cubular geometry the ambient occlusion calculation can become exceedingly expensive, which is why SSAO was invented, but the mostly-static cubes allow for a relatively discrete approximation. For my implementation, it effectively samples the ambient occlusion term at each vertex and allows interpolation on the GPU to smooth things out. For a visible face vertex, there’s a possibility of 0-3 adjacent cube volumes, so I give an AO term which ranges from 0-1 in 0.33 increments. Results look good (although this particular screenshot is a bit dark because I toned down the ambient light and the camera-position point-light).

Problems

Sticking point 1: I compute all my world geometry procedurally on worker threads in 32^3 chunks of unit cubes. At the boundary between chunks, there was no guarantee that any data was present in order to compute the ambient occlusion neighborhood properly, so visual seams were visible on continuous surfaces. To solve this issue, I separate my world generation into two phases – cube data and vertex buffers – and reduced the visible area without adjusting the data area so that a margin of 1 generated chunk existed around the boundaries of the visible chunk grid. This has the detrimental impact of either reducing my visual range or increasing my computational cost to keep the same visual range, because I now need an extra margin around everything. It would perhaps have been cleaner if I could generate a 1-cube margin around each chunk, but most of my chunk generation code is discontinuous and relies on a random number generator seeded based on the chunk ID, rather than the ID of the unit cubes. As a side effect of improving the cube neighborhood to look across chunk boundaries, my hidden face removal is now more aggressive and perf is slightly better since the game has less geometry to render.

Sticking point 2: When modifying the cubescape (adding or removing individual cubes), I was only updating the chunk in which the cube resides, but it now has potential to impact adjacent chunks also, both for ambient occlusion and hidden face removal. Easily fixed by updating adjacent chunks whenever the modified cube is on the surface of the chunk anywhere. I thought about adding an optimization to inspect whether the neighboring chunk would actually see any change, but haven’t bothered with that yet.

Vaguely related problems

  • I use the VPOS pixel-shader semantic with my deferred shaders to generate the texture coordinate used for looking up the corresponding texel for the currently rendering pixel. On the Xbox, VPOS behaves strangely if predicated tiling gets enabled, which is typically because you want a fat rendertarget due to MSAA, GBuffer, high resolution, etc. I imagine that the viewport transform or whatever feeds the VPOS semantic isn’t set quite right. I worked around this by disabling MSAA on my final rendertarget (since the underlying GBuffer textures are lower-resolution with no MSAA).
  • I was experimenting with occlusion culling to optimize my rendering in dense environments, but it appears to be essentially incompatible with deferred shading in XNA 4.0, for one flawed reason: you cannot disable color writes when a floating-point rendertarget is bound.
    • Attempting to do so produces an exception with the text “XNA Framework HiDef profile does not support alpha blending or ColorWriteChannels when using rendertarget format Single”; I assume that this is an oversight in the XNA API because I’m unaware of a reason why floating-point rendertargets cannot support disabled color writes, even in MRT situations. Here are the relevant device caps for various hardware: COLORWRITEENABLEINDEPENDENTWRITEMASKS; since XNA 4.0 requires a D3D10-capable video card, and those caps are enabled on 100% of D3D10 hardware sampled by that site.
    • My workaround for this flaw involves un-binding the offending rendertarget before issuing my occlusion queries and re-binding it afterwards, when I want to render real geometry again; this causes rendertarget toggling several times per frame, which is not ideal.
    • However, my workaround doesn’t seem to work on Xbox; the rendertarget contents preservation flag appears to be broken, so my GBuffer gets filled with bad data.

Posted in CubeFortress, XNA | Leave a Comment »

GPU Profiling in XNA

Posted by redbeard on May 25, 2011

Everyone knows that when you have performance problems, you need to profile your application to find hot-spots and optimize them. XNA doesn’t give very good feedback about where time is spent; for CPU usage there are a number of tools and instrumentation samples which will tell you where time is being spent, but the GPU isn’t quite so blessed. You can use tools like PIX for Windows (PIX on Xbox is excellent but is only available to XDK developers), but most XNA developers care about Xbox performance where the tools are more limited; built-in instrumentation is possibly your best bet on Xbox. On Windows there are also options such as NVPerfHud which are vendor-specific but can tell you things that are hidden to all other programs & APIs.

For in-game instrumentation, we have to deal with whatever public APIs are exposed; what tools are available to use through XNA? D3D9 supports Timestamp queries which signal when they complete and record a timestamp for comparison against other events. XNA 4.0 Hi-Def exposes only Occlusion Query, which can apparently be hijacked to behave like Event queries, which will also signal when complete but without a recorded timestamp. That timestamp is essential in order to diagnose where time is being spent on the GPU, but we need to generate it on the CPU. If we just issue a bunch of queries throughout the frame and don’t try to check them until the end of the frame or the beginning of the next, there’s a good chance that a bunch of them will become signalled before you check them, and they’ll all get the same timestamp, so we need to poll the query continuously or at least frequently.

My approach, which I encapsulated in a helper class called GpuTrace:

  1. Use a queue to track all issued queries. The GPU will execute them in-order, so only the oldest one is worth checking for completion at any time.
  2. Track a named label with each issued query for diagnosis
  3. At frame start, start a stopwatch and issue an occlusion query (Begin and End in a pair)
  4. Issue an occlusion query before and after any major GPU milestone (Clear, scene.draw, post-proc), you can fence in a bunch of Draw calls with just one query
  5. After each query is issued, poll the front of the queue to see if it’s completed yet
  6. Whenever a query on the polling queue has completed, record the CPU stopwatch timestamp and put it aside, perhaps in a separate queue of timestamps
  7. At the end of the frame, spin-wait for all active queries to finish (perform the previous step on each)
  8. Examine the list of timestamps. The first one indicates the latency between issuing the query and seeing it complete. Deltas between two timestamps indicate the time spent by the GPU completing whatever activity was between them.
  • Keep in mind that adding all these queries, constantly polling them, and waiting for the GPU to finish at frame end will negatively impact your overall framerate, but the integrity of the data provided should be reliable.
  • You may also extend the occlusion query mechanism to separate the Begin/End calls, so the query will actually record the number of modified pixels during its execution.

Here is a screenshot of my in-game profilers, and another. The GPU profiling information gleaned from this approach is displayed in the upper right. The numbers displayed are the time since frame start until the noted event, and the time delta between an event and the previous event. Also visible on the left is a multi-threaded CPU profiler which I have borrowed and cleaned up from here; my cleanup was mostly focused on garbage collection. At the bottom of the screen is a framerate graph, which will show spikes and instability where a simple FPS number hides them; I’d like to do more work on profiling instrumentation to diagnose framerate spikes in particular.

Posted in XNA | Leave a Comment »

Garbage Control

Posted by redbeard on May 25, 2011

Garbage collection in XNA can be a big deal if you care about framerate hitching, which is visually jarring and can cause input polling problems. After a long enough time, any per-frame allocation will eventually be subject to garbage collection, which will pause all threads and search memory for live and dead objects to make space for more allocations. The garbage collector only has an opportunity to run if you allocate memory, therefore you can control when garbage collection happens by only performing allocations during expected downtime, such as during loading screens. The same wisdom applies for writing real-time applications in native code, but .NET provides more of a safety net if you do things quick & dirty at first; the GC will clean up your mess whereas a native app might just crash when running out of memory. Reducing and optimizing your allocations can also improve your loading times and reduce your minimum-spec hardware requirements.

Profiling

The best tool I’ve used for diagnosing garbage allocation & collection is CLR Profiler. It has a few quirks: the allocation graph has some rendering bugs with the line traces, it has no option to “ignore this node”, and the histogram views don’t always allow enough resolution to see all the objects which have been allocated (ie if they’re less than 1KB).

With that said, the “histogram by age” view is quite useful for finding per-frame allocations; start up your game and get it into the desired state, then just let it run for a couple of minutes with no adjustments. After running a while, open up the “histogram by age” view and see if any allocations are younger than 1 minute. An option in the right-click context menu will even show you a visual call-stack of how those allocations happened. Note that if you exit your application, it will probably generate a bunch of garbage on shutdown, so it’s probably safe to ignore very-young allocations so long as the middle ground is clear.

Another useful view is the “allocation graph”, which will show you the list of all major allocations by basic type if you scroll all the way to the right, and the visual call-stack of how they were conjured as you look to the left. This view can be a little misleading if you have large chunks of one-time allocations for scratch memory pools, and there is no option to ignore or exclude specific nodes, but anything that bubbles up to the top should warrant investigation.

Major Sources of Garbage

… as discovered in my current codebase, with suggested corrective actions.

  • String manipulation:
    • Never use String.Format, for multiple reasons: hidden StringBuilder allocation, value types get boxed, hidden ToString calls allocate strings, and the params arguments create an array.
    • Never concatenate strings with + operator.
    • Never append non-string values onto a StringBuilder, these methods call ToString under the hood.
    • Never call StringBuilder.ToString, it allocates a string to return.
    • Do all string manipulation with a pre-allocated capacity-capped StringBuilder, use custom functions for numeric conversion like itoa & ftoa.
    • SpriteBatch can render StringBuilder objects without calling ToString.
    • I use a custom StringBuilderNG class (NG = no garbage) which wraps a standard StringBuilder and forwards only the “safe” methods which generate no garbage, and implements new methods for custom conversion of int & float values. This approach is more prone to bugs, but itoa and ftoa are relatively easy to implement.
  • DateTime.Now: replace with Stopwatch.Elapsed when used for profiling
  • params[] arguments: pre-allocate an array of appropriate size, and populate it immediately before the call. Don’t forget to null out the entries afterwards to avoid dangling references.
  • Value type boxing for IComparer<T> & IEnumerable<T>: implement CompareTo on the underlying type, and use an explicit value-type Enumerator like in List<T>. This also helps reduce virtual function calls.
  • Worker threads typically need scratch memory to operate on, this can be pre-allocated in a pool, and threads can grab a chunk of scratch memory when they start up and release it when they’re done

Other Resources

Posted in XNA | 3 Comments »

Deferred Shading in CubeWorld

Posted by redbeard on May 16, 2011

For my CubeWorld prototype, I wanted to try some screen-space effects like SSAO (screen-space ambient occlusion) and also compare the performance of deferred lighting versus standard forward-rendering lights; I’m also interested in just implementing a deferred renderer as I haven’t experimented with the concept before.

I found some good foundation code & explanation in the articles at http://www.catalinzima.com/tutorials/deferred-rendering-in-xna/, which got me started with some directional and point light functionality. I made a few modifications here and there such as combining multiple directional lights into a single pass and taking some liberties with the C# and shader code; I also used a procedural cube instead of a sphere mesh for my point light. I also found some good intro material in the NVidia deferred rendering presentation from “6800 Leagues Under the Sea”: http://developer.nvidia.com/presentations-6800-leagues-under-sea, which includes a few optimizations which can help (if you’re not using XNA, I’ll get to that). The performance of the deferred lighting is quite good on my PC, although I haven’t tried it extensively on the Xbox.

After seeing the deferred shading in action, I wanted to make even more use of the G-Buffer for effects that can make use of it, and one of the primary effects I’m interested in is SSAO, because the cube world looks rather artificial with all the faces shaded relatively flatly. I implemented the SSAO shader described in a gamedev.net article, which provides dense and somewhat unintuitive code, but it works and the rest of the article explains the concepts used. The article offered little guidance for tweaking the 4 input parameters such as “bias” and “scale”, but I found some numbers which appeared to work, and named them more intuitively for my internal API. I’m currently using only a 9×9 separated blur rather than the 15×15 suggested in the article. The effect works, but the screen-space random field is plain to see, and it seems to be more pronounced on distant geometry; I can probably do some more work to try and resolve those artifacts. A much more distracting artifact is the total loss of ambient occlusion at the edges of the screen in certain condition, I’m not sure if there’s a reasonable solution for that. I may try some static AO calculations for each cube face to see if I can get stable results that way.

The overall flow of my deferred renderer, currently (1 or more “passes” per step below):

  1. Render all scene geometry into G-Buffer
  2. Generate noisy SSAO buffer at half-resolution
  3. Blur SSAO buffer horizontally and vertically at full-resolution
  4. Accumulate directional and point lights, one per pass
  5. Combine lighting, albedo, and SSAO into final image

Some issues I ran into when implementing my deferred shading in XNA:

  • XNA does not allow you to separate the depth-buffer from a render-target, which means you cannot use the stencil optimization for light volumes as discussed in the NVidia “6800 Leagues” presentation. The optimization allows you to only light-shade the pixels which are within the light volume, rather than all the ones that may be covered by it but are too distance to be affected. This requires that you retain the depth buffer from the geometry pass and use it to depth-test and store stencil values for light geometry, and then use those stencil values for a different render-target, specifically the light accumulation buffer.
  • Xbox 360 has 10MB of framebuffer memory linked to the GPU, which works great if you’re rendering a single 1280×720 render-target and depth-buffer at 4 bytes each (about 7MB). When you want 3 rendertargets and a depth-buffer, you can either “tile” the framebuffer and re-draw the geometry multiple times, or you can drop the resolution until all the buffers fit; I opted for the latter option, using 1024×576 (for 16:9). XNA doesn’t expose the ability to resolve the depth-buffer to a texture, which means you must include your own depth render-target in your G-Buffer, or else that target resolution could be increased. On PC, the memory limitation is lifted, but you still can’t read back depth via D3D9 so the extra buffer still applies.
  • I can see visible banding on my point lights, I’m not sure if this is due to banding in the light buffer itself or the final compositing. XNA 4.0 exposes the HdrBlendable format, which on Xbox uses a 10-bit floating-point value per component, but with only 7 bits of mantissa I’m not convinced it offers any reduced banding from 8-bit fixed-point components, just a different pattern.

Screenshots of my results:

  • Directional and point lights: screenshot (debug display shows albedo, depth, normals, and lighting)
  • SSAO random samples before blurring: screenshot (slightly more noisy than it should be, due to non-normalized random vectors)
  • SSAO after blurring: screenshot
  • Comparison images from before deferred shading was implemented: shot 1, shot 2

Other resources I came across while implementing these things:

Posted in CubeFortress, XNA | Leave a Comment »

Threading on Xbox360 with XNA

Posted by redbeard on May 15, 2011

I discovered recently that threads on the Xbox are not managed or scheduled across CPU cores automatically. You must call SetProcessorAffinity from within the target thread, and that thread will run only on the specific core (hardware thread) from then onwards. I wrote some helper code at the top of my worker-thread management function (m_Threads is the list of potentially active thread objects). My “do thread work here” implementation just spins waiting for work to arrive in a queue.

private static void DoThreadWork(object threadObj)
{
  Thread myThread = threadObj as Thread;
#if XBOX
  // manually assign hardware thread affinity
  lock (m_Threads)
  {
    // avoid threads 0 & 2 (reserved for XNA), and thread 1 (main render thread)
    int[] cpuIds = new int[] { 5, 4, 3};
    for (int n = 0; n < m_NumActiveThreads; ++n)
    {
      if (m_Threads[n] == myThread)
      {
        int cpu = cpuIds[n % cpuIds.Length];
        myThread.SetProcessorAffinity(cpu);
        break;
      }
    }
  }
#endif
  // do actual thread work here
  ...

Posted in XNA | Leave a Comment »

Year To Date

Posted by redbeard on May 15, 2011

  • March 2011: “Marksman: Long Range” released, see postmortem.
  • April 2011: Started working on TileTactics, a top-down tile-based tactical game concept inspired by X-COM. Got some basic functionality working: tile visibility for units, path-finding, some primitive UI state machine stuff. Put this project on the back burner because I wasn’t feeling terribly inspired for the overall vision of the project, because I didn’t want to make a straight-up X-COM clone and was concerned for the viability of the project on the Xbox Live marketplace.
  • April 2011:  Started on CubeWorld (working title), inspired by Minecraft to create a cubular world where the geometry is structured but random and infinite. This is my currently active project, so I’ll write some more articles about the details.

Posted in CubeFortress, Marksman, TileTactics, XNA | Leave a Comment »