Beta#3:

- "P96 new"-mode only: RTG display depth is now completely virtual. Emulation converts to real color space (configured in display panel) 16->32, 8->32, 32->16 and 8->16 supported. 8-bit and 16-bit (without overlay) now possible in windowed mode. I guess real 8-bit support can be removed because it is not very well supported anymore.. 24-bit and 16-bit RGB (instead of 16-bit BGR) can be also implemented if some weird program don´t support other modes.. NOTE: make sure you have 32 bit selected in display panel if you want real 32-bit RTG mode. (instead of 32bit converted to 16bit)
- "new mode" fixes and optimizations
- chipsetRTG switching is now instant (no closing and reopening) if both modes have same dimensions. (to do: different dimensions)
- "old p96 mode" will be removed soon because it is incompatible with above new features. (perhaps some optimizations to do first) Other new possible features will be RTG filter support (for example 2x support for small resolutions like 320x200)

New test results, it seems dual core is not the explanation after all..
AMD Athlon 2400+, AGP ATI Radeon 9700Pro, Nforce2 chipset
Code:

New:
.============= SPEEDRESULTS ==============.
| RectFill()................ 3645 op/s |
| RectFill() Pattern........ 1400 op/s |
| WritePixel().............. 1348548 op/s |
| WriteChunkyPixels()....... 2652 op/s |
| WritePixelArray8()........ 2649 op/s |
| WritePixelLine8()......... 192255 op/s |
| DrawEllipse()............. 111929 op/s |
| DrawCircle().............. 125293 op/s |
| Draw().................... 37982 op/s |
| Draw() Hor/Ver............ 90725 op/s |
| ScrollRaster() X.......... 143 op/s |
| ScrollRaster() Y.......... 292 op/s |
| PutText()................. 47151 op/s |
| BlitBitMap().............. 21492 op/s |
| BlitBitMapRastPort()...... 20230 op/s |
| BitMapScale()............. 2123 op/s |
|--------------- Intuition ---------------|
| OpenWindow().............. 473 op/s |
| MoveWindow().............. 873 op/s |
| SizeWindow().............. 423 op/s |
| CON-Output................ 647 op/s |
| ScreenToFront()........... 50 op/s |
`=========================================´
Old:
.============= SPEEDRESULTS ==============.
| RectFill()................ 2981 op/s |
| RectFill() Pattern........ 608 op/s |
| WritePixel().............. 1432538 op/s |
| WriteChunkyPixels()....... 426 op/s |
| WritePixelArray8()........ 425 op/s |
| WritePixelLine8()......... 72137 op/s |
| DrawEllipse()............. 47304 op/s |
| DrawCircle().............. 46483 op/s |
| Draw().................... 4403 op/s |
| Draw() Hor/Ver............ 14068 op/s |
| ScrollRaster() X.......... 144 op/s |
| ScrollRaster() Y.......... 227 op/s |
| PutText()................. 24113 op/s |
| BlitBitMap().............. 6807 op/s |
| BlitBitMapRastPort()...... 6292 op/s |
| BitMapScale()............. 236 op/s |
|--------------- Intuition ---------------|
| OpenWindow().............. 246 op/s |
| MoveWindow().............. 688 op/s |
| SizeWindow().............. 195 op/s |
| CON-Output................ 437 op/s |
| ScreenToFront()........... 50 op/s |
`=========================================´
Update beta#3:

- "Full-window" mode does not reopen window when switching modes
- 16 and 8-bit RTG mode optimized blitter operations
- RTG template-blit ("RectFill() Pattern") operations optimized (do pen value endian swap only once per blit, not every pixel..)
- other misc fixes