Windows memory mapped file twice as slow as fread() when cached

Started by
39 comments, last by Puffin 4 years, 4 months ago

Edit: clarified that this is about reading disk files with a memory mapped file, not shared-memory-only files (which may have been a source of confusion in the discussion).

I've heard that reading a disk file by using a memory mapped file should be very fast when the file already cached in OS memory, because the application can directly read from the page cache. But in a test I noticed there's an overhead of 5000 CPU cycles per each 4k page when it's first accessed (page fault etc.), which makes it twice as slow (2 GB/s) as simply fread()'ing the data into a buffer (4 GB/s), and 11 times as slow as simply reading from a large memory block (22 GB/s). When reading the memory mapped file the second time (without remapping it), it's as fast as reading from a memory block. I used a 100 MB file so that the CPU cache should not be involved.

Also, when a newly allocated 100 MB memory block is read the first time (in which case malloc() forwards to VirtualAlloc() I presume), it suffers the same penalty (10x) as a memory mapped file.

Is there a way to overcome the overhead, or is it true that memory mapped files are slower even if the cached case, where they were supposed to be at their best? It looks like there's a page committing cost of some sort, and it's much higher than simply copying or zeroing memory. It's possible to call VirtualLock() to commit several pages at once, but the total performance remains the same. Flags passed to the CreateFile or CreateFileMapping functions didn't seem to help (e.g. FILE_FLAG_SEQUENTIAL_SCAN).

Advertisement

MemoryMapped Files are tricky to handle, it depends of the OS you are working on as same as hardware you use. In general they should be faster than simple reading from disk, however this involves a few circumstances you need to handle:

  • Alignment is important because the OS has to fit the file into a set of (mostly 4k) pages.
  • Disk Cache might become the bottleneck here, if you have a lot of cache misses on disk the OS is slow in copying anything from disk to RAM
  • RAM overflow can cause page swapps more often than a free space memory would. It can happen that files are read into memory and pages then written back to disk

To handle memory mapped I/O properly, we align our files to 64k. This way a file fits into a OS page and also fills the usual Disk Cache to prevent sector swaps and other processes reading in between. If you are on Windows, there is also the View not just the memory mapped file handle that can cause issues. In Unix based systems, the memory mapped file is what you requested but on Windows you can request the whole file but open several views to different offsets and size of the same file.

How did you access the files? Creating a memory mapped file handle and opening a view should be enougth here to obtain a binary pointer to the data in RAM. If you are reading for example using a stream, there is always some buffer invovled to not request byte by byte from Disk

Thanks for your answer. For reading the files, I used CreateFile, CreateFileMapping and MapViewOfFile and let the OS decide the memory region. I did obtain a pointer to RAM and I believe it does indeed read from the OS page cache, but that the problem would be that there's some overhead from the OS when the application accesses each mapped page for the first time, even if there's no disk access involved (and even if just allocating a new memory block with malloc, the overhead is almost exactly the same).

Other people have run into this as well (e.g. https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/, but it talks mainly about memory only and not memory mapped files). It seems like there wouldn't be a way around this overhead, but I'm a bit puzzled if that's the case, because people claim memory mapped files to be faster that fread() when cached. It would seem however that fread() is faster, because it doesn't have to map pages into application memory, unless I've mistaken?

To handle memory mapped I/O properly, we align our files to 64k

By this, do you mean that you align data inside the file so that you usually access stuff within each 64k page at the same time (e.g. file offsets 0..64k, 64k..128k), or something else?

There are many gains, but you must build your system to take advantage of it.

If you are reading the file, parsing it as you go, processing it into a usable asset, that's not going to see a big benefit from memory mapping. If you're linearly scanning the data, especially if you're scanning it repeatedly, you're entering bad scenarios.

The biggest benefit is to have your data fully processed and ready to go in an in-memory format. There is no parsing, just map the data into memory and use it as the final content. No parsing, no processing, no decoding. This can save against reading data you aren't immediately using, can better use the cache, can better use scatter/gather operations over the disk, and get all the other benefits you likely read about. But doing so requires preprocessing your data into that final format.

That preprocessing is where memory boundaries must be considered. The final format needs to match the data patterns, the hardware memory patterns, and the disk access patterns, hence the boundary sizes. But when you've preprocessed them into that format, you simply map a view to the data you need and use it directly. Often you don't even need to memcopy or similar, just directly use the data in the final form.

As for the article on the 'hidden costs of memory allocation', those are among the reasons games rely on large memory pools allocated when the game starts. Enormous chuncks and large pages can be used, you don't pay a cost for zeroing due to security while performance matters, you can use lightweight allocation schemes rather than the heavy systems required by the OS, etc. In fact, on most game consoles the startup functions routinely allocate 100% of the memory on the hardware to ensure the OS isn't incurring those costs in libraries or system functions.

Ok, thanks, that makes sense. Anyway, what I'm primarily puzzled about is:

  • After memory mapping a large file that is already in disk cache (Windows 10, 64-bit), why is it 10x slower to read from it for the first time than reading regular RAM (that is not in CPU cache) or than reading it the second time?
  • After allocating a large memory block (e.g. 100MB), why is it 10x slower to read / write it for the first time than on subsequent times (without CPU cache in either case)? Not explained by memory zeroing alone, because it's indeed 10x slower than simple zeroing.
  • Because the two cases above have the same performance, I wonder if they are caused by the same underlying reason (e.g. page fault overhead), and if there's anything that would help? I tried VirtualLock() to map several pages at once, but it wasn't any faster.

My primary use case is in development environment, to make game restart and build tools faster. That is, to avoid spending 0.5s per every GB loaded from disk cache plus every GB of memory allocated just because of unnecessary page mapping overhead. For the end user it's not maybe such a concern, because they'll have to wait longer anyway for the data to be read off the disk (and they don't have to restart as often as the developer).


As I mentioned, this can be caused from several cirumstances but I guess to disk caching your data. If you acquire a page for being mapped to memory this dosen't mean that is instantly filled with data, it is on-demand read from disk if you access it first time. Another cause for the issue might be your RAM. If you have a lot of stuff in RAM and/or small RAM size, the OS schedules pages to disk and back so you are reading data from disk to have them in a page for being scheduled back to disk is definetly not a gain here.

Puffin said:
By this, do you mean that you align data inside the file so that you usually access stuff within each 64k page at the same time (e.g. file offsets 0..64k, 64k..128k), or something else?

Yes thats true. Our build tools have a puzzle algorithm hat is getting several asset data sets and tries to map those from largest to smallest to fill each 64k chunk properly in our linear asset file. One asset can obtain multiple linear chunks so loading this is just a window of N chunks without offset while smaller assets always have an offset within the chunk. Rest of the data is filled with trash bytes.

The real benefit is to access data multithreaded, so have an resource manager that loads, locks and unloads chunks and then processes asset requests asynchronously is a real performance boost even for production code.

I used this also for example for a database system that is addressing data pages in an up to 100 TB large database without any performance impacts

No, I think the file is already cached in memory by the OS in the test as I said, because I'm reading a 100MB file repeatedly on a system with 32 GB RAM, and it's not accessing the disk either (except for the very first time after restarting the machine and then it is of course slow once, but that's not the issue here). Also alignment shouldn't matter when reading 100MB sequentially.

So the question in my previous post is still open (edit: bolded the "already in the disk cache").


The file may already be in the disk cache, but that doesn't necessarily mean the OS is using the same physical memory for file mapping. My guess is that the initial performance hit is the copy from the cache to the mapped physical memory. Not as slow as reading it from disk, but still not very fast as it's going through the same page fault and I/O mechanisms underneath.

You'll get the biggest benefits from memory-mapped files once you have lots of processes accessing the same files, as once each page is in physical memory it won't need to be read from disk/cache again and that I/O cost goes to nothing (provided it isn't paged out). I recently worked on a tool that required random access to a few GB worth of data on disk... in and of itself not a large feat, but for various reasons several hundred instances of this tool needed to run at once on any given machine.. It would have been impossible to accomplish that task without memory-mapped I/O, which is where it really shines.

And to add a comment in addition to @Zipster 's answer, you can't ever know what the OS does in the background unless you coded it by your own or use something Unix related where the source code is open accessible. Restart your test on a clean maschine, freshly installed without any updates and internet and only with the needed tools attached (not even Visual Studio) and then have a look if this happens again. This is the only possible solution to preclude that there are no other processes interrupting your tests and to get clean measurement data

My guess is that the initial performance hit is the copy from the cache to the mapped physical memory. Not as slow as reading it from disk, but still not very fast as it's going through the same page fault and I/O mechanisms underneath.

At first I also thought that maybe it's copying from the cache to another memory block, but that wouldn't explain the performance hit alone. I actually tested this to be sure, and it's ~3x faster to copy a large memory block to a new location and read it again than to read from a newly-opened but cached memory mapped file. I also tested that the slowdown occurs even if the file is already open and mapped to another logical memory address. And it's been claimed often that memory mapped files usually map the application memory space to the cache memory directly, without copies, otherwise the hundreds-of-tool-instances case wouldn't work either.

This topic is closed to new replies.

Advertisement