The last few days I’ve been working on a set of polygon routines. (Don’t bother asking for them — if I get clearance to put them up, I will.)
The initial algorithm used floating point math. I timed it, but did not write down the time. I converted it to pure integer and timed it — it was about 5 seconds. I wasn’t happy with the results, though… the rounding errors looked bad.
“Why not make it an ARMlet?” you are probably asking. Well, I realized the algorithm needed some tuning first.
So I changed it back to floating point and started optimizing. First, I preflighted the calculations that needed to be done only once per vertex, instead of once per vertex per screen row or once per vertex per pixel. Next, I moved the calculations that only had to be done once per screen row. I discovered to my surprise that I had no floating point calculations per pixel anymore, despite still having full floating-point accuracy. I fixed up what I had and started counting instructions, then started counting division or multiplication instructions. Finally, I had the code about as good as I could get it without dipping into assembler.
Finally, I converted it to an ARMlet. (As an aside: the Metrowerks ARM tools are painfully buggy. Just selecting them in the Linker panel starts Codewarrior crashing constantly. Because of this, it took much longer than it should have to finish the conversion.) When I was finished, the resulting code took 0.04 seconds to run, and fully tuned 680×0 code took 0.29 seconds to run: the 680×0 code took 7.25 times longer.
On a whim, I decided to paste in my integer only code. (I couldn’t find the original floating point code I had devised anymore — I didn’t bother comitting it to source control once I got it working since it as too slow to keep.) I kept the more efficient loops, only replacing the core calculation. The result took 2.1 seconds in 68k, and 0.83 seconds for the ARMlet. That means that if I had tuned only the loops and converted to an ARMlet, the code would be roughly 20 times slower than it is now. The untuned ARM code takes 3 times more time than the tuned 68k code.
Granted, the tuning took more effort as I had to think about it more. But if I had to do just one, I was much better off with the tuning.
| Rendering Performance | |
| code | time (s) |
| original integer in 68k | 2.10 |
| original integer in ARM | 0.83 |
| optimized floating-point in 68k | 0.29 |
| optimized floating-point in ARM | 0.04 |
(All times are on a Tungsten T3.)