So, we want to get some practical results about game performance on two different systems, and because performance features don't work on one platform, they have to be turned off on others to make comparison fair?
Do I have to bend an explanation from an iron wire so you'll get it, too? Heck, let's make opengl comparison ... I'll write software implementation, but I'll only implement ONE FUNCTION! Thus, we'll only benchmark that. As a result, we'll notice that my opengl implementation beats every other implementation in performance! Woo Yay! How about that? Flawed benchmark? So, how much do I have to implement before it stops being flawed? Oh, just basic functionality but anything that'd actually make a difference won't be used?
I can't see what your "fair" benchmark will tell. What do the values mean? Really, tell me, what significance do those numbers have? Of what practical or even theoretical use are they to anyone or anything?
Are we going to compare system bogomips next?