RedBeard's Dev Blog

JIT Optimizations on Xbox

Posted by redbeard on May 27, 2011

I came to a sickening realization last night: the JIT optimization performed by the .NET compact framework on Xbox is pitifully weak. Some of the most basic optimizations I’ve come to expect from a decade of C++ programming and “trust the compiler” advice are not present. This realization has the potential to kill my CubeWorld project, if it prevents me from reaching my perf goal on Xbox.

CubeWorld deals with a reasonable amount of memory to represent the game world, and this data is used for collision detection, creating rendered geometry. Each unit cube exists at a unique grid location and is described by a single byte at a unique location in an array (although it could be even smaller if I chose to pack data into sub-byte-sized pieces). These arrays of cubes are represented as “chunks” which are 32*32*32 cubes, for a total of 32KB in a flat or multi-dimensional array. Chunks are generated dynamically as the player moves around the world, destroying chunks that are moving away and recycling their memory for chunks that are appearing on the horizon. I use a variety of methods to construct the shape of the world on this cubular canvas, but the essence of all of them involves iterating over the array and filling it with data. On my PC, which sports a Q6600 CPU and 4GB of RAM, I see fairly good performance in terms of both framerate and world generation; on Xbox the framerate is smooth but the world generation is horribly slow, despite using as many threads as feasible.

Something is slow! Break out the profiler! I cobbled together a quick benchmark to see how bad the world-generation code was being. Wrapping the chunk-generation in 10 iterations gave me an average time of 70-100 milliseconds per chunk; as a reference, my PC runs this benchmark with a result of about 6 milliseconds, an order of magnitude faster. Hardware specs: Q6600Xenon. Sure, the PC has branch prediction and instruction re-ordering, but surely there’s something to be gained. I broke it down even further, to the bare essence: iterating the array and filling it with data.

As you can see, the SafeA function performs numerous iterations of the same task: iterating the array in 3D and filling it with a constant value. I time this externally with a stopwatch object, and feed it 1000 iterations. I don’t bother to wipe the cache, because I’m writing data here, not reading it. Ideally I can hit the cache all the time anyway, so I don’t really care if this is exercising the cache; I care more about syntactic and semantic sugar in the programming language which might be getting in the way. Let’s start with some nice clean code, with helper functions for clarity!

		private static int FlattenIndex(int x, int y, int z)
		{
			return z + ArrayDimension * (y + ArrayDimension * x);
		}

		private static void SetMyArray(byte[] theArray, int x, int y, int z, byte val)
		{
			int index = FlattenIndex(x, y, z);
			theArray[index] = val;
		}

		private static void SafeA(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				for (int cx = 0; cx < ArrayDimension; ++cx)
				{
					for (int cy = 0; cy < ArrayDimension; ++cy)
					{
						for (int cz = 0; cz < ArrayDimension; ++cz)
						{
							SetMyArray(testArray, cx, cy, cz, val);
						}
					}
				}
			}
		}

Timing result: 2.33 milliseconds. That’s 105MCycles from the Xenon CPU to fill a 32KB array. Overkill much? Let’s see if that helper function might not be getting inlined effectively…

		private static void SafeB(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				for (int cx = 0; cx < ArrayDimension; ++cx)
				{
					for (int cy = 0; cy < ArrayDimension; ++cy)
					{
						for (int cz = 0; cz < ArrayDimension; ++cz)
						{
							int index = FlattenIndex(cx, cy, cz);
							testArray[index] = val;
						}
					}
				}
			}
		}

Timing result: 1.55 milliseconds. That’s a bit better, but if that function wasn’t getting inlined, maybe the other one isn’t either…

		private static void SafeC(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				for (int cx = 0; cx < ArrayDimension; ++cx)
				{
					for (int cy = 0; cy < ArrayDimension; ++cy)
					{
						for (int cz = 0; cz < ArrayDimension; ++cz)
						{
							int index = cz + ArrayDimension * (cy + ArrayDimension * cx);
							testArray[index] = val;
						}
					}
				}
			}
		}

Timing result: 1.17 milliseconds, another noticeable improvement. Call me crazy, but I want to try doing away with that temporary variable and see what happens…

		private static void SafeD(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				for (int cx = 0; cx < ArrayDimension; ++cx)
				{
					for (int cy = 0; cy < ArrayDimension; ++cy)
					{
						for (int cz = 0; cz < ArrayDimension; ++cz)
						{
							testArray[cz + ArrayDimension * (cy + ArrayDimension * cx)] = val;
						}
					}
				}
			}
		}

Timing result: 0.67 milliseconds! That’s almost twice as fast as with the temporary variable, and more than three times faster than the original naive code. I could probably improve on this a bit more by uglying it up even further: loop unrolling, manually hoisting loop invariants (cy is invariant within the cz loop), but I’m hoping these optimizations are actually implemented by the JITer (haha probably not). As a point of reference, running those same 4 benchmarks on my PC produced the same timing result for each one: 0.06 milliseconds; clearly the x86 JIT optimization is more robust.

Let’s try something drastic and switch to unsafe code, and see if perhaps the implicit array bounds-checking is getting in the way.

		private static void UnsafeB(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				for (int cx = 0; cx < ArrayDimension; ++cx)
				{
					for (int cy = 0; cy < ArrayDimension; ++cy)
					{
						for (int cz = 0; cz < ArrayDimension; ++cz)
						{
							unsafe
							{
								fixed (byte* ptr = testArray)
								{
									ptr[cz + ArrayDimension * (cy + ArrayDimension * cx)] = val;
								}
							}
						}
					}
				}
			}
		}

Timing result: 1.45 milliseconds. Not exactly an improvement, but there’s supposedly a fixed cost every time you enter an unsafe or fixed block, so let’s move that outside the loop and try again…

		private static void UnsafeC(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					fixed (byte* ptr = testArray)
					{
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								for (int cz = 0; cz < ArrayDimension; ++cz)
								{
									ptr[cz + ArrayDimension * (cy + ArrayDimension * cx)] = val;
								}
							}
						}
					}
				}
			}
		}

Timing result: 0.61 milliseconds. That’s more like it, but it’s about the same as before we went unsafe. Now that we’re here, why not party on some pointer arithmetic and see what happens?

		private static void UnsafeD(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					fixed (byte* ptr = testArray)
					{
						byte* p = ptr;
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								for (int cz = 0; cz < ArrayDimension; ++cz)
								{
									*p = val;
									++p;
								}
							}
						}
					}
				}
			}
		}<

Timing result: 0.361 milliseconds. That’s a considerable improvement over the previous attempt, and also better than the safe version (although we didn’t try manually hoisting those loop invariants). Operating on bytes might be slow, the CPU probably prefers to write whole 32-bit integers to memory instead, let’s try that!

		private static void UnsafeE(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					fixed (byte* ptr = testArray)
					{
						byte* p = ptr;
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								for (int cz = 0; cz < ArrayDimension; cz += 4)
								{
									int ival = val | (val << 8) | (val << 16) | (val << 24);
									int* pi = (int*)p;
									*pi = ival;
									p += 4;
								}
							}
						}
					}
				}
			}
		}

Timing result: 0.26 milliseconds. Eh, that’s a bit better but not quite what I was expecting. By this point I’ve lost pretty much all trust in the JIT optimizer, so let’s manually hoist that braindead-obvious loop invariant for the ival calculation…

		private static void UnsafeF(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					int ival = val | (val << 8) | (val << 16) | (val << 24);
					fixed (byte* ptr = testArray)
					{
						byte* p = ptr;
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								for (int cz = 0; cz < ArrayDimension; cz += 4)
								{
									int* pi = (int*)p;
									*pi = ival;
									p += 4;
								}
							}
						}
					}
				}
			}
		}

Timing result: 0.11 milliseconds. Twice as fast, but I can’t believe the C# compiler and JIT optimizer were really that dumb! Hell, I bet it’s not even unrolling the inner loop a little bit…

		private static void UnsafeG(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					int ival = val | (val << 8) | (val << 16) | (val << 24);
					fixed (byte* ptr = testArray)
					{
						byte* p = ptr;
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								for (int cz = 0; cz < ArrayDimension; cz += 4)
								{
									int* pi = (int*)p;
									*pi = ival;

									++pi;
									*pi = ival;

									cz += 4;
									p += 8;
								}
							}
						}
					}
				}
			}
		}

Timing result: 0.07 milliseconds. That’s a decent bump, I guess no unrolling was being done. Let’s just go overboard and fully unroll the inner loop…

		private static void UnsafeH(int iterations, byte[] testArray, byte val)
		{
			for (int n = 0; n < iterations; ++n)
			{
				unsafe
				{
					int ival = val | (val << 8) | (val << 16) | (val << 24);
					fixed (byte* ptr = testArray)
					{
						byte* p = ptr;
						for (int cx = 0; cx < ArrayDimension; ++cx)
						{
							for (int cy = 0; cy < ArrayDimension; ++cy)
							{
								// no z loop
								{
									int* pi = (int*)p;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									++pi;
									*pi = ival;

									p += 32;
								}
							}
						}
					}
				}
			}
		}

Timing result: 0.04 milliseconds. Wow. This is where I stopped because it was 4 am, although there’s at least a couple more obvious optimizations: starting out with an int* and incrementing that instead of repeatedly converting from byte*, and flattening the 2 outer loops to a single combined loop invariant. To dig even deeper I’d like to get some SIMD going or try 64-bit integer writes too. All told, that’s about a 60-fold performance difference (6,000 percent) from the original naive safe code. On my PC this final benchmark runs at about 0.006 milliseconds, which is still a 10-fold performance difference from its relatively optimized original safe version.

Data and Resources: PC Results Summary, Xbox Results Summary, Source Code

Admittedly, I’ve just written a needlessly complicated memset which probably has an even more optimized version available using SIMD and memory precaching and all sorts of other specialized instructions, but I don’t really care about that. I have completely lost trust that C# or the .NET runtime has my back in terms of performance optimization in cases I’ve come to think of as obvious. For a really depressing thought, consider this: WP7 disallows use of the unsafe context. Good luck with that!

Back in CubeWorld, there are several options:

  1. Abandon all hope, ditch the project because Xbox .NET perf is FUBAR
  2. Forget Xbox and target PC instead.
  3. Hand-optimize all hot spots in the critical path, which might still not be enough to avoid option 1. Look at that optimized unsafe code up there, it’s uglier and more prone to bugs than optimized native C code, thanks to all the copy & paste with no C macros available. It’s also probably still slower than what an optimizing C compiler would actually emit given the same chunk of code.
  4. Try generating stuff on the GPU somehow, since the shader compiler does good optimization and the GPU is inherently parallel. Might have a large impact on my overall framerate or introduce significant hitching, and still might not avert option 1.

A relevant post from Shawn Hargreaves points out that my benchmarking methodology might be somewhat flawed, but it’s at least relatively consistent. Here’s an interesting post I found while investigating this topic; I’d love to see that in the wild, but the author commented in the replies that he’s been forced to halt work on that project. Perhaps I’ll try something similar with a [ForceInline] attribute or somesuch…

One Response to “JIT Optimizations on Xbox”

  1. Sorry you spent so much time learning the hard way… The .NET Compact Framework has long been criticized for its lack of JIT optimization. You can find articles about it predating the XNA Framework on Xbox, and the XNA Game Studio forums are full of people complaining, analyzing, suggesting optimizations, etc. The problem is not the C# compiler; you can rest assured that the C# compiler rarely does anything more than constant folding.

    In comparison to the C++ compiler, the NetCF JIT compiler generates horrible code. That doesn’t mean all hope is lost. You just have to identify what’s expensive and what isn’t. The most painful and ubiquitous thing it’s missing is inlining – particularly painful due to the high cost of function calls on Xenon. It also does some things extra-slowly, like interpreting virtual function calls.

    For your particular example, I think you could get some easy wins by specifically coding around the expensive things. Function calls are expensive (since there is no inlining), and unnecessary work is expensive (hidden work counts).

    In your original SafeA, the cost was in the two function calls, the calculation of the index, and the bounds checking on the array, which all happened each iteration.

    If you rewrote the three nested loops to be a single loop iterating from 0 to testArray.Length, I think you would see a dramatic improvement. I’m not certain, but even on NetCF that might be converted into a single call to memset. What it does is eliminate both the index computation and the implicit bounds checking.

    PS: The vector instructions are not exposed via NetCF and the data is not properly aligned for use with those instructions. I recommend you do not waste your time on that.

Leave a Comment

Your email address will not be published. Required fields are marked *