Everyone knows that when you have performance problems, you need to profile your application to find hot-spots and optimize them. XNA doesn’t give very good feedback about where time is spent; for CPU usage there are a number of tools and instrumentation samples which will tell you where time is being spent, but the GPU isn’t quite so blessed. You can use tools like PIX for Windows (PIX on Xbox is excellent but is only available to XDK developers), but most XNA developers care about Xbox performance where the tools are more limited; built-in instrumentation is possibly your best bet on Xbox. On Windows there are also options such as NVPerfHud which are vendor-specific but can tell you things that are hidden to all other programs & APIs.
For in-game instrumentation, we have to deal with whatever public APIs are exposed; what tools are available to use through XNA? D3D9 supports Timestamp queries which signal when they complete and record a timestamp for comparison against other events. XNA 4.0 Hi-Def exposes only Occlusion Query, which can apparently be hijacked to behave like Event queries, which will also signal when complete but without a recorded timestamp. That timestamp is essential in order to diagnose where time is being spent on the GPU, but we need to generate it on the CPU. If we just issue a bunch of queries throughout the frame and don’t try to check them until the end of the frame or the beginning of the next, there’s a good chance that a bunch of them will become signalled before you check them, and they’ll all get the same timestamp, so we need to poll the query continuously or at least frequently.
My approach, which I encapsulated in a helper class called GpuTrace:
- Use a queue to track all issued queries. The GPU will execute them in-order, so only the oldest one is worth checking for completion at any time.
- Track a named label with each issued query for diagnosis
- At frame start, start a stopwatch and issue an occlusion query (Begin and End in a pair)
- Issue an occlusion query before and after any major GPU milestone (Clear, scene.draw, post-proc), you can fence in a bunch of Draw calls with just one query
- After each query is issued, poll the front of the queue to see if it’s completed yet
- Whenever a query on the polling queue has completed, record the CPU stopwatch timestamp and put it aside, perhaps in a separate queue of timestamps
- At the end of the frame, spin-wait for all active queries to finish (perform the previous step on each)
- Examine the list of timestamps. The first one indicates the latency between issuing the query and seeing it complete. Deltas between two timestamps indicate the time spent by the GPU completing whatever activity was between them.
- Keep in mind that adding all these queries, constantly polling them, and waiting for the GPU to finish at frame end will negatively impact your overall framerate, but the integrity of the data provided should be reliable.
- You may also extend the occlusion query mechanism to separate the Begin/End calls, so the query will actually record the number of modified pixels during its execution.
Here is a screenshot of my in-game profilers, and another. The GPU profiling information gleaned from this approach is displayed in the upper right. The numbers displayed are the time since frame start until the noted event, and the time delta between an event and the previous event. Also visible on the left is a multi-threaded CPU profiler which I have borrowed and cleaned up from here; my cleanup was mostly focused on garbage collection. At the bottom of the screen is a framerate graph, which will show spikes and instability where a simple FPS number hides them; I’d like to do more work on profiling instrumentation to diagnose framerate spikes in particular.