Frame buffer speed, when does it matter?

Graphics and GPU Programming Programming

Started by Anthony Prizmich July 06, 2016 04:51 PM

9 comments, last by Infinisearch 7 years, 9 months ago

150

Author

July 06, 2016 04:51 PM

I'd like to ask how important is the speed of video memory and when is it used? When creating resources, we copy from RAM to VRAM and that process is severely limited with the PCI-E bus. But when does memory speed come to spotlight? When does HBM shine with it's insane bandwidth/interface?

Thanks.

SeanMiddleditch

17,596

July 06, 2016 09:33 PM

Rendering to a texture/buffer uses up bandwidth. Reading from a texture/buffer uses bandwidth. Rendering/reading more data (e.g. rendering in higher resolution) uses more bandwidth. Rendering over existing texel data (aka overdraw) wastes excess bandwidth.

More memory bandwidth/speed means that you can render more things per second.

Memory bandwidth particularly is helpful when rendering to higher resolutions. Rendering to a 4K screen - or to the dual screens needed for VR - uses significantly more bandwidth than typical for 1080p gaming, simply on account of there being far more pixels (not just being drawn, but also being read/rewritten in all the post-processing stages).

Sean Middleditch – Game Systems Engineer – Join my team!

dpadam450

2,403

July 06, 2016 09:45 PM

It is always a spotlight and always used.....it's a computing system, it is always fetching memory. Not sure what you meant by frame buffer. FBO? If you want to see the performance impact of GPU Ram speed, then download a perfomance/OC tool and turn down the memory speed about 500Mhz while in a game and see what happens to the framerate.

If you are talking system RAM, then no that speed doesn't matter too much with interfacing to the GPU, at least if you look at low speed DDR3 vs high speed DDR3, the benchmarks I've seen might get 1 FPS cump from 59 to 60.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

Hodgman

52,717

July 06, 2016 10:18 PM

GPU ALU (computation) speeds keep getting faster and faster -- so if a shader was ALU-bottlenecked on an old GPU, on a newer GPU with faster ALU processing, that same shader might would likely become memory bottlenecked -- so faster GPUs need faster RAM to keep up :)

Any shader that does a couple of memory fetches is potentially bottle-necked by memory.

Say for example that a memory fetch has an average latency of 1000 clock cycles, and a shader core can perform one math operation per cycle. If the shader core can juggle two thread(-groups) at once, then an optimal shader would only perform one memory fetch per 1000 math operations.

e.g. say the shader was [MATH*1000, FETCH, MATH*1000] then the core would start on thread-group #1, do 1000 cycles of ALU work, perform the fetch, and have to wait 1000 cycles for the result (before doing the next 1000 cycles of work). While it's blocked here though, it will switch to thread-group #2, and do it's first block of 1000 ALU instructions. By the time it gets to thread-group #2's FETCH instruction, (which forces it to block/wait a 1000 cycle memory latency), the results of thread-group #1's fetch will have arrived from memory, so the core can switch back to thread-group #1 and perform its final 1000 ALU instructions. By the time it's finished doing that, thread-group #2's memory fetch will have completed, so it can go on finishing thread-group #2's final 1000 ALU instructions.

If a GPU vendor doubles the speed of their ALU processing unit -- e.g. it's now 2 ALU-ops per cycle, then it doesn't really make this shader go much faster:

The core initially does thread-group #1's first block of 1000 ALU instructions in just 500 cycles, but then hits the fetch, which will take 1000 cycles. So as above, it switches over to processing thread-group #2 and performs it's first block of 1000 ALU instructions in just 500 cycles... but now we're only just 500 cycles into a 1000 cycle memory latency, so the core has to go idle for 500 cycles, waiting for thread-group #1's fetch to finish.

The GPU vendor would also have to halve their memory latency in order to double the speed of this particular shader.

Increasing memory speed is hard though. The trend is that processing speed improves 2x every 2 years, but memory speed improves 2x every 10 years... in which time processing speed has gotten 32x faster... so over a 10 year span, memory speed tends to actually get 16x slower when compared to processing speeds :o

Fancy new technologies like HBM aren't really bucking this trend; they're clawing to keep up with it.

So GPU vendors have other tricks up their sleeve to reduce observed memory latency, independent of the actual memory latency. In my above example, the observed memory latency is 0 cycles in the first GPU, and 500 cycles on the second GPU, despite the actual memory latency being 1000 cycles in both cases. Adding more concurrent thread-groups allows the GPU to form a deep pipeline and keep the processing units busy while performing these very latent memory fetches.

So as a GPU vendor increases their processing speed (at a rate of roughly 2x every 2 years), they also need to increase their memory speeds and/or the depth of their pipelining. As above, as an industry, we're not capable of improving memory at the same rate as we improve processing speeds... so GPU vendors are forced to improve memory speed when they can (when a fancy new technology comes out every 5 years), and increase pipelining and compression when they can't.

On that last point -- yep, GPUs also implement a lot of compression on either end of a memory bus in order to decrease the required bandwidth. E.g. DXT/BC texture formats don't just reduce the memory requirements for your game; they also make your shaders run faster as they're moving less data over the bus! Or more recently: it's pretty common for neighbouring pixels on the screen to have similar colours, so AMD GPUs have a compression algorithm that exploits this fact - to buffer/cache pixel shader output values and then losslessly block-compress them before they're written to GPU-RAM. Some GPUs even have hardware dedicated to implementing LZ77, JPEG, H264, etc...

Besides hardware-implemented compression, compressing your own data yourself has always been a big optimization issue. e.g. back on PS3/Xb360 games, I've shaved a good number of milliseconds off the frame-time by changing all of our vertex attributes from being 32 bit floats, to being a mixture of 16 bit float and 16/11/10/8 bit fixed point values, reducing the vertex shader's memory bandwidth requirement by over half.

. 22 Racing Series .

Infinisearch

3,058

July 07, 2016 06:27 AM

In addition to what others have said I'd like to add blending to the list since it at minimum it is a read-modify-write which basically doubles bandwidth requirements.

-potential energy is easily made kinetic-

Anthony Prizmich

150

Author

July 07, 2016 01:15 PM

Thanks everyone. I'm into learning graphics from the very start and I'm having an extremely hard time actually finding basic resources to learn just exactly how it all works, all the actual details of the pipeline, graphics workflow. I'm willing to buy any book, but can't find anything that would explain it on the level of graphics cores, caches, memory, memory controllers - what actually happens with byte arrays containg textures, how and what schedulers do, core architecture thing.

Why am I asking this is (higher level view) because I'm interested in why HBM is beneficial and when does it stop being such. I presumed that interface width of the memory enables us to transfer more data in a shorter time, meaning that probably many post-processing effects that operate on fully composed images, get faster with more bandwidth since every read of the memory into registers is faster, and every write also.

How much would a Fury line loose if used GDDR5 instead of HBM?

Matias Goldberg

9,637

July 07, 2016 03:29 PM

I presumed that interface width of the memory enables us to transfer more data in a shorter time

It enables to transfer more data in the same time, not in shorter time. It's a very important distinction.
Think of the problem as a truck travelling 500km and it takes them 5 hours to complete. The truck can only hold 1tn of cargo. If you use two trucks, you can send twice the amount of cargo. But it still will take 5 hours to complete.

Why am I asking this is (higher level view) because I'm interested in why HBM is beneficial and when does it stop being such.

It depends on something we call "bottleneck". A game that performs a lot of reads and writes may be bandwidth limited, thus memory that has higher bandwidth will run faster.
But if another game executes a lot of math (which uses the ALU units Hodgman describes) and that's most of what it does, then higher bandwidth won't do jack squad because that's not the bottleneck.

Going back to the truck example:

You have to transfer 2tn of cargo. You have one truck. This is your bottleneck. You need 5hs to travel 500km and send 1tn, then another 5hs to get back and load the rest. Then 5hs more to travel 500km again. In total all the travelling took 15hs by using one truck.
If you use two trucks, you'll be done in 5hs. Memory bandwidth and bus bandwidth behave more or less the same. Because you can send more data in the same amount of time, but you needed a lot of data to send; doubling the amount of data you can transfer allows you to finish sooner only if it's the bottleneck. But you can never go less than 5hs in one trip. (Why? you ask? because GPUs can't send data faster than the speed of light)

Now let's add the "ALU" to the example: Let's suppose all you have to send in the truck a machine that weights only 70kg (that's 0.07tn). However disassembling the machine for transportation and load it into the truck takes you 8 hours. The truck then begins its journey and takes 5hs. Total time = 13hs.
You could use two trucks... but it will still take you 13hs because having an extra truck doesn't help you at all in disassembling the machine. What you need is an extra hand, not another truck. The bottleneck here is in disassembling the machine, not in transportation.

In this example people = ALU; trucks = bandwidth.
More people = you can disassemble and load the machine into the truck faster.
More trucks = you can send more cargo per trip.

More ALU = you can do more math operation in the same amount of time.
More bandwidth = you can do more loads and store from/to memory in the same amount of time.

So, to answer your question: does an increase of bandwidth make a game run faster? It depends.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Infinisearch

3,058

July 07, 2016 04:13 PM

Thanks everyone. I'm into learning graphics from the very start and I'm having an extremely hard time actually finding basic resources to learn just exactly how it all works, all the actual details of the pipeline, graphics workflow. I'm willing to buy any book, but can't find anything that would explain it on the level of graphics cores, caches, memory, memory controllers - what actually happens with byte arrays containg textures, how and what schedulers do, core architecture thing.

https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

http://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/

http://s09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdf

http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork.pdf

-potential energy is easily made kinetic-

Hodgman

52,717

July 08, 2016 12:32 AM

I'm into learning graphics from the very start and I'm having an extremely hard time actually finding basic resources to learn just exactly how it all works, all the actual details of the pipeline, graphics workflow.

This is a good overview of the nitty gritty details that you don't really need to know :D
https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/

Why am I asking this is (higher level view) because I'm interested in why HBM is beneficial and when does it stop being such. I presumed that interface width of the memory enables us to transfer more data in a shorter time, meaning that probably many post-processing effects that operate on fully composed images, get faster with more bandwidth since every read of the memory into registers is faster, and every write also.

Most of the time you can ignore the actual name of the RAM - DDR3/GDDR5/HBM2/etc... - and just look at the performance stats.
e.g. The Wii U has 2GB of DDR3 with a maximum bandwidth of 12.8 GB/s. I don't really care that it's DDR3, but that 12.8 GB/s number is important. If you're running at 60 frames per second, that's 218.5 MB/frame...
So even though there's 2GB of RAM available, the bandwidth number tells us that we can only access under 10% of it in any one frame. And that's the theoretical max bandwidth, not the actual performance that you game will see. The max bandwidth can only be achieved if you write a program that does nothing but transfer data around. Real programs tend to alternate between doing some processing and doing some data transfers, and they have bottlenecks and stalls, etc, causing real performance to always fall short of theory.
Meanwhile, the Xbox360 has 512MB of GDDR with a maximum bandwidth of 22.4 GB/s -- or at 60Hz, that's 382.3 MB/frame -- so even though the WiiU has 4x more memory, the amount that it's actually able to touch in any one frame is less!

For most purposes, just looking at that max bandwidth figure is enough to give you a ballpark performance metric of how much data you should be able to access. If you then work out how much data you need to access in order to implement your algorithms, you'll be able to tell if it's theoretically possible or not.

It's only when doing very low level optimizations that the other properties of RAM will be of importance to you. Two of them could be:
* the latency of a memory fetch -- including in a "cache miss" situation and a "cache hit" situation. The latency between a processor and RAM tends to be extremely high these days, as much as 1000 cycles, so there's usually a small cache bolted onto the processor to speed things up. If a value is already present in the cache, it might only take 10-100 cycles to fetch the data. Caches often have multiple levels of different sizes -- smaller caches closer to the processor, and larger caches further away and sometimes shared between a few processing cores.
One of the biggest areas of low-level code optimization these days is deliberately organizing your data (and the order of your processing operations) in such a way to maximize cache hits and avoid cache misses -- also known as Data oriented Design by some people.
* when doing even lower level optimization, the actual physical structure of the RAM (or cache) can be important. Some memory systems have different memory buses for different physical parts of the RAM. e.g. Texture A is fetched over bus#1 and Texture B is fetched over bus#2. These buses may operate asynchronously and in parallel, so doing 10x fetches from Texture A might take twice as long as doing 4x fetches from Texture A plus 5x fetches from Texture B, even though the same total amount of data has been moved. Variations of this issue are "bank conflicts", "conflict miss" (for associative caches), or "false sharing" (for SMT cache lines).
Check out http://www.futurechips.org/chip-design-for-all/what-every-programmer-should-know-about-the-memory-system.html

How much would a Fury line loose if used GDDR5 instead of HBM?

It depends on the shaders that you're running on it :D
If the shaders are completely ALU bottlenecked, e.g. procedurally generating all the graphics with no need for memory, then there would probably be no performance difference!

. 22 Racing Series .

Anthony Prizmich

150

Author

July 22, 2016 04:06 PM

What kind of shaders can get mem. bw. bottlenecked on, let's say, 4K resolution?

Frame buffer speed, when does it matter?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Frame buffer speed, when does it matter?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines