I think one of the things that will eventually happen is the GPU will disappear when CPUs are powerful enough to handle all computation, after all a GPU is just a partially, fixed pipeline, CPU.
GPUs these days have extremely little amounts of fixed pipeline left in them. WIth compute shaders, they're now basically an extremely-SIMD, massively hyper-threaded, many-core, RISC co-processor, with hierarchical and/or NUMA memory and an asynchronous DMA controller to boot.
CPUs are never going to replace such a thing - unless it becomes a GPU, which won't happen, because then they won't run legacy code efficiently.
Most of our code is stuck on the CPU because we all learned a particular way of writing software, and not enough people have re-learned how to engineer their software for other kinds of hardware architectures yet.
We still assume that it's a good abstraction for any bit of our code to be able to have a pointer to any bit of our data, that RAM is one huge big array of data, and that the CPU can operate directly on that data.
In reality, CPU makers have done a tonne of magic behind the scenes to try and make that abstraction seem like it works -- they mirror RAM into small caches, otherwise our software would be 1000x slower, they run statistical prediction algorithms to try and guess which bits of RAM to mirror at what times, they reorder our code to try and hide memory access latencies, they speculatively start executing branches that might not actually be taken, and then insert invisible stalls and fences to ensure that after all this parallel guesswork behaves just like the hypothetical, serial abstract machine that your C code was written for.
Instead of wasting transistors on all that magic, the SPE design discarded it all and spent their transistors on adding more cores. The cores didn't do magic behind the scenes - instead requiring programmers to write code differently. Instead of a magic cache that sometimes makes RAM seem fast, when it works, they gave us an asynchronous DMA controller, which lets you perform a non-blocking memcpy between RAM/cache and then poll to see if it's completed. You have to explicitly tell the CPU in advance that you need to move some data from RAM to cache so that the CPU can operate on it, instead of having the CPU pretend that it can operate on RAM (when it can't) and doing guesswork to operate a cache for you. They removed the statistical branch prediction magic, and instead relied on the programmer annotating their if statements as to whether they were statistically more likely to be true or false. If you need to write some results to RAM, you don't need the CPU to write out each word one by one, waiting for them to be written to cache -- you can kick off an async memcpy and continue doing useful work while the parallel DMA hardware does the work in the background.
The result was a CPU that required a radically different approach to writing software (arguably better, arguably worse, just different), but offered a completely insane amount of performance due to the completely different design.
But we can't have such a thing because of inertia. We're stuck with CPUs continuing to emulate designs that we decided on in the 80's, because that's the hypothetical abstract machine that our programming languages are designed around.
GPUs on the other hand have inertia of their own. The popularity of computer graphics has ensured that every single PC now has a GPU inside it. Computer graphics people were happy to learn how to write their code differently in order to gain performance, giving the manufacturers unlimited freedom to experiment with different hardware designs. The result is a huge amount of innovation and actual advancement in processing technology, and in parallel software engineering knowledge.
Because they've been so successful, and aren't going anywhere soon (because CPU architectures are designed so differently that they can't compete for parallel workloads) everyone else now has a chance to learn how to write their programs in the ultra-wide-SIMD compute shader pattern, if they care to.
In the last generation of game engines, we saw systems that have traditionally been single-threaded (and there's a lot of people who were preaching "games are inherently single-threaded, multi-core won't help!") be replaced by multi-core and NUMA-compatible systems.
In the next generation of game engines, we're going to see many of these systems move off the CPU and over to the GPU compute hardware instead.
The fact is, that in terms of ops per joule, GPUs, and processors such as the SPE's are far, far ahead of CPUs by design. Note that the PS3's Cell CPU is, what, 8 years old now, but it still matches a modern Core i7 in terms of FLOPS due to a more efficient (but non-traditional-CPU-like) design...