Quote:
Original post by C0D1F1ED
A dual-core Pentium 4 at 3.4 GHz has 256-bit L2-cache busses so we get 217.6 GB/s internal memory access. And lets not forget that this uses a highly efficient cache hierarchy, plus out-of-order execution, to deal with latency. The PPU has neither a cache or our-of-order execution.
So first of all 256 GB/s ain't *a lot* faster. Their marketing even uses '2 Tb/s' just to make it sound more impressive. Secondly, since physics processing has to do random memory accesses this adds latency. So it's quite possible that even though the busses might be able to deliver 256 GB/s, actual memory bandwidth is much lower. Last but not least this is confirmed by the fact that 8 SIMD units at 400 MHz simply can't consume 256 GB/s. With one vector unit and one scalar unit each they need at most 64 GB/s, the rest is waste. And unless every instruction accesses memory the actual bandwidth usage is going to be even lower.
Well, it doesn't really matter how fast the connection between the L2 cache and the core is because that's all on the processor die. The bottle neck is getting the dat to the L2 cache through prefetching the data from main memory. And the pipelines for the P4 is so deep that it has been shown that any series of prediction misses will cause the flushing of the pipeline and the L2 cache, which in turn is detrimental for performance. This is why clock for clock, the Pentium 4 is not as fast as the Pentium-M, which is actually based on the Pentium 3 core. Its no secret that Intel had been trying to cover up the superior performance margin of the Pentium-M and keep it to the mobile market because it completely destroys the P4 in both performance and energy consumption at lower clock speed. So, the clock speed myth is back in full swing again as the truth is, faster clock speed doesn't mean higher performance. So, why was there the P4 design in the first place? Marketing. The deep pipelines made it possible to scale to higher clock speeds faster.
Also, you don't need cache to get higher performance. If you look at the architecture for the PS2, its based on an architecture that is philosophically the opposite of PC architecture. It gets rid of cache almost completely, but gives all processing units a 10 channel wide direct memory access path. So, all the data is moved around in direct streaming form. The whole point of cache was to reduce idle time of the CPU, as the path between cache and the CPU will be faster than from main memory to CPU. Then prefetching was used to further cache information that the CPU "may" use in the future to further reduce memory latency, but then you get the issue of prediction misses. However, if you settle for a lower clock speed and feed the CPU as much data as fast as it can process it, then you really don't need any cache whatsoever. This is also why the P4 originally required RDRAM for maximum performance as only RDRAM had the bandwidth to feed the CPU data as fast as it could process it.
Quote:
Original post by C0D1F1ED
Quote:
the key quote is here
Quote:
But we were willing to sacrifice some game-play "feedback" in order to achieve great scalability (10K inter-colliding objects, for example, is where things really start to get interesting). We have solid game-play physics in our flagship Havok Physics product - so we wanted to come up with an add-on solution that game developers could use to layer on stunning effects that look and behave correctly
I haven't seen any Ageia demo yet that shows more 'feedback'. In fact I'm sure it's problematic for them. Legacy PCI offers only a fraction of the bandwidth of PCI-Express.
The point I think is that HavokFX is an "add-on" solution to the Havok engine itself. Which means that you don't really offload much to the GPU, while most of the stuff is still done on CPU side. This, of course, has the great property that you can easily turn it off. On the other hand, Ageia's solution completely offloads most of the calculations from the CPU.
Quote:
Original post by C0D1F1ED
Anyway, I'm sure PhysX is a nice processor for physics processing, no doubt about that. But it's only going to be bought by hardcore gamers. For PhysX to be succesful in the long run it needs much more market penetration. But very soon it will get serious competition from multi-core CPUs and DirectX 10 graphics cards that are unified and well-suited for GPGPU. And frankly I don't expect any actual game to use 10,000 objects. That's cool for a demo but pointless for gameplay. A few hundred or thousand pieces of debris from explosions can perfectly be handled by next-generation CPU/GPU. So my only point is that PhysX just has no future.
It shouldn't be forgotten that the graphic card market was started by hardcore gamers who dug deep into their pockets and bought the first voodoo cards from 3dFX, I know I was one of those.
And I really would like to restate my view that the whole concept of GPGPU is an oxymoron in itself. It pretty much came out of the academic need of low cost fast processing, and that is pretty much where it really should end. It really is only an academic exercise in trying to see what we can force the GPU to do other than graphics given that it has all this horse power. Has anyone ever thought of building super computers from multiple Quad-SLI machines based completely off GPUs? I don't think so. The whole reason its call "General Purpose" is to try to attract people to the field in hopes that you can get "general purpose" processing out of it. But in the end, the result is that it is a specialized piece of hardware that does graphics. And to make it do "general purpose" stuff, we fool it into thinking its doing graphics.
Also, be very careful about saying things like "that having 10k objects or more is pointless for gameplay" or you'll run into the trouble that Bill Gates got into when he said that no one will ever need 640KB of RAM about 25 years ago. (I recall it was Bill)