Texture data flow from CPU to GPU

Started by
12 comments, last by Valakor 1 year ago

I have a question about best practices with regards to how texture data flows through an engine from disk to CPU to GPU. At some point you have to provide the GPU a pointer to the compressed texture data so that it can be copied into VRAM. I want to optimize this pipeline as much as possible, but I need to know more about how this is commonly done. I will be loading the textures from a file where the data has been preprocessed into a BCn format.

Questions:

  1. Should I keep a copy of the texture around in CPU memory? (I'm thinking about cases where device is lost and I need to upload it again) Is it OK to discard the CPU memory buffer once it is uploaded to GPU?
  2. Would I see any benefit from memory mapped files, where I can directly provide the mapped pointer to the graphics API without having to allocate a buffer on the heap? This has a drawback of not allowing additional entropy-based compression on top of BCn, and requires that I load textures on the main thread (OpenGL). It's also cleaner architecture to load all textures into a memory buffer first (possibly async), then upload to GPU later, but I have concerns about the memory usage spikes (i.e. load 2 GB of textures into CPU, …other initialization code …. , upload to GPU, then deallocate 2GB of CPU textures).

I guess the same questions apply to vertex/index buffers as well.

Advertisement

Aressera said:
Should I keep a copy of the texture around in CPU memory? (I'm thinking about cases where device is lost and I need to upload it again) Is it OK to discard the CPU memory buffer once it is uploaded to GPU?

Doing this can use up a lot of RAM. I tried this approach when developing device-lost handling, and it double the RAM-usage of my 2D-game from ~300 MB to ~700. While I'm simply loading most textures at startup and not based on when they are needed, a full 3d-app would likely require even more RAM even if being smart about what textures are being kept around. So, unless you don't mind adding potentially GBs of RAM to your application, I wouldn't do it.
Instead, for device-lost, you can just load the texture-data again. This should happen so infrequent that the additional overhead is negligble.
If you really need the data of a texture on the CPU, you could always stage→map, or you could give your textures a flag “Need Data on CPU”, similar to what Unity does, to keep the data around.
For the device-reset, for the runtime/build I store the offset into the pack-file where the texture-data is stored. I then load the entire once via memory-mapping (see below), then have each texture reload itself, based on the mapped-file pointer and the offset. I query all stats from the old ID3D11Texture-object (which can be done even if the device is lost) and then just recreate that object internally in the renderer.

Aressera said:
Would I see any benefit from memory mapped files, where I can directly provide the mapped pointer to the graphics API without having to allocate a buffer on the heap? This has a drawback of not allowing additional entropy-based compression on top of BCn, and requires that I load textures on the main thread (OpenGL). It's also cleaner architecture to load all textures into a memory buffer first (possibly async), then upload to GPU later, but I have concerns about the memory usage spikes (i.e. load 2 GB of textures into CPU, …other initialization code …. , upload to GPU, then deallocate 2GB of CPU textures).

Memory mapped files could reduce memory-allocation overhead, however you would still be IO-bound, very likely. I am using memory-mapping in a lot of places in my engine, and I compared to just fread-ing everything, and the difference in performance is very similar, with memory-mapping maybe being slightly faster.
I can't tell you much about the situation with very high amounts of GB used by textures as, like I've said, my own texture-budget at the time is 300-400MB. I do use memory-mapped files for this as well, but as I said for those operations you will likely be bound by disk-access. So, even if you made 10 threads that all fread a different file, at some point you will be stalling your disk/IO-bus. So unless you can actually create the resource on a thread (which btw you could also do with OpenGL, see https://www.khronos.org/opengl/wiki/OpenGL_and_multithreading)

Another approach that could be used is to create the GPU texture representation( this will be need to be done anyways), and map the the texture object data store into process address space. This approach provides ‘direct’ access to the memory that can be populated with your texture data acquired by any means you provided. This has advantage of not requiring additional memory allocation. However, the implementation will be a little involved as now one has to worry about CPU-GPU synchronization etc. ( Ex. using pixel buffer objects( PBO ) for texture upload with OpenGL. Granted additional GPU memory will be required for each PBO used, but a few PBO can be re-used to upload the texture data).
The GPU won't be able to use your memory mapped files ‘directly’, but in any case your memory mapped file still need to be backed by physical memory( could be page memory, but this is all handled by the VM system) so doesn't really help with reducing memory allocation completely. Furthermore, think about the case where a certain page is not currently resident and the GPU goes to access that page, now the GPU would incur a stall waiting for the VM system to map that page, make it resident or access. Would be curious to see the performance of this approach though if its feasible.

Aressera said:
I want to optimize this pipeline as much as possible,

You stumbled upon our most ancient problem of all - you cant. Thats why have for example vertex buffer objects / transforming and lighting / (and shaders) so you have to move as less data to the gpu as its humanly possible.

  1. i always kept a pointer of the texture in the system ram. its handy if you want to modify it.
  2. the communication with the gpu is far slower than you think. think on your gpu as your ethernet card, basically its on the same bus and same transfer speed.

You may want to check this topic.

If you are really concerned… when your engine uploads a texture, you can load the next jpg texture on a thread or so. Then you can speed up the things a little bit.

(I hope by “mapping files” you werent referring to giant uncompressed textures that will make your game multiple dozens of gbytes big. )

Most random generic people will have 1gb video cards, they will not be happy if you will try to upload 2 gb of textures - do not try to solve the issues with raw muscle, megalomania and shoddy malpractices. Be realistic when designing the game, and you will not hit these walls.

No offense but you try to focus on some issue which is NOT the real issue, and you didnt put deep tougths into the problem. For example if you upload 2gb texture, thats 3.4gb texture because you will probably need to generate mipmaps. Textures are not the only thing in the memory, you will have your front buffer, the back buffer, the Z buffer, things of the driver, the vertex buffers, various temp and command buffers there, not to mention some bloat from the os. You really should focus on the entire issue instead of blindly jumping to something that makes no sense, and will hurt you in the long term.

If you see a theorical possibility of X then divide it with 100 to be realistic and to play safe.

Geri said:
No offense but you try to focus on some issue which is NOT the real issue, and you didnt put deep tougths into the problem

Offense taken. I don't think it's your place to tell me what I can and cannot do with regards to the amount of assets I want to have in my “game” (which is at least a few years away from being at alpha stage). Given that things are so far out, I'm not worried about meeting some arbitrary min spec from years ago which you seem to be targeting in your own work. I'd rather target something advanced so that it will be good quality when the game is finished.

I mentioned in my OP that I will be using BCn compression on textures, if you had read that, so I won't be using big uncompressed textures. I am making a planetary terrain renderer which may have hundreds of different rock/sediment types that could be used to render a scene, and this can easily add up to 2GB with PBR and 2048px images (roughly 12MB per material with channel packing). I can guarantee you that AAA games from even 4 years ago (e.g. RDR2) use at least 2GB for textures (I played on 4GB card and had some hiccups at higher settings).

It's also a bit presumptuous to assume I don't think about mipmaps. Since I'm using compressed texture formats of course I already take that into account, since they must be computed in advance. I'm not “blindly jumping into something that makes no sense”, I'm carefully considering my options before writing a bunch of code (e.g. by posting here), which is the opposite of what you say I'm doing.

Honestly, if you don't understand what I meant by memory mapped files, then you are in no position to provide advice on this topic.

Geri said:
the communication with the gpu is far slower than you think. think on your gpu as your ethernet card, basically its on the same bus and same transfer speed.

This is just plain wrong. PCIe 3.0 from 2010 has bandwidth of 15.75 GB/s for typical 16x GPU slot. Newer PCIe is up to 242 GB/s. In comparison, the best ethernet cards do 10 Gb/s (1.25 GB/s), which is 12.6x slower than PCIe 3.0.
https://en.wikipedia.org/wiki/PCI_Express

Aressera said:
This is just plain wrong. PCIe 3.0 from 2010 has bandwidth of 15.75 GB/s for typical 16x GPU slot

You are googling theoretical pci-e bandwidths and thinking the malloc in your texture management code is what makes the texture upload slow.

Aressera said:
I don't think it's your place to tell me what I can and cannot do

If you are not seeking information, and you seek self justification: then you probably should not have opened this topic, and bought a parrot instead. That will repeat your thoughts for you.

Aressera said:
I can guarantee you that AAA games from even 4 years ago (e.g. RDR2) use at least 2GB for textures (I played on 4GB card and had some hiccups at higher settings).

You should play real games, and not aaa kino flyby techdemos to have realistic view over the type of graphics you can render. I am not surprised you arent speaking from programming experience and just quoting vague ,,i played this and that and thats how its done'' type nonsense.

Should i mention that 90% of people not even have a video card in a pci-e slot… they dont have video memory… and they dont have video card. They have an igp in their cpus. (and no, that still doesnt means you have unlimited bandwidth to it from your cpu)

Aressera said:
using BCn compression on textures, if you had read that, so I won't be using big uncompressed textures

I really havent read that part, here is my 1 cent on that: from 2008 to 2009 i used s3tc when uploading textures (letting opengl itself to compress the textures with the original s3tc texture compression implementation). Later i switched to arb texture compression in 2010 or so, went with that till 2013. It barely made any difference, it didnt helped on the memory utilization too much, and the only occasion it did, when it was used on super low end cards (which is the original intention of texture compression), otherwise it just introduced more problems than what it solved.

Aressera said:
I am making a planetary terrain renderer which may have hundreds of different rock/sediment types that could be used to render a scene

The terrain data - if it that complex - will be just as big problem as the textures themself. You are trying to balance two giant speed hogs, meanwhile you should understand that far aways objects will use mipmaps (if you allow them). Non used mipmap levels will be transferred out from the gpu and used levels back, creating giant hangs, or the driver just simply fails to do it and the entire thing will be dead slow if you are filling up the video memory too much. Using texture compression will introduce its own problem, first of all even if you trying to render a lot of texture data, there are only a given number of pixels on the screen, its not going to make your thing look better after a certain point in quality.

Good luck with your project, but i dont think your approach will result in a widely usable product.

Geri said:
If you are not seeking information, and you seek self justification: then you probably should not have opened this topic, and bought a parrot instead.

If you might notice, the other replies in this thread were much more helpful than yours. On the other hand, your attitude is quite condescending. You make assumptions about my abilities and background which are just not true. The information in your posts is either out-of-date or factually inaccurate. You can't even type properly and make tons of spelling/grammar mistakes. You didn't even bother to read my OP, and then responded as if you are talking to a beginner, which I most certainly am not based on the advanced nature of my question. Please refrain from further response to this post.

You make assumptions about my abilities and background


You can't even type properly and make tons of spelling/grammar mistakes

i havent made any assumptions of your abilities. but now i understand your special ability as you clearly pointed it out: speaking a language. meanwhile mine is making video games.

dont worry, i am not interested to discuss you further this point. good luck with your earth renderer :D

This is not going well.

Anyway.

With Vulkan, you can have one thread loading textures and meshes into the GPU while another thread renders. So you can use concurrency to overcome some loading performance issues. If you're doing a big world system, where content is being loaded dynamically during gameplay, that's a big help. (I'm currently struggling with getting the Rend3/WGPU/Vulkan stack to do this. Turns out lock conflicts in WGPU need to be fixed, and that's being done.)

If you just load everything into the GPU at startup, such parallelism and the complexity that goes with it is probably unnecessary. More info about the use case here would help.

Non used mipmap levels will be transferred out from the gpu and used levels back, creating giant hangs, or the driver just simply fails to do it and the entire thing will be dead slow if you are filling up the video…

What systems actually do that kind of swapping in and out of GPU memory? I've had to code something like that, and will soon rewrite it for better performance. I think UE5 has that, as part of asset cache management, but it's well above the “driver” level. Standard GPU mipmapping speeds up texture fill, but doesn't save GPU memory space.

This topic is closed to new replies.

Advertisement