46,226

December 03, 2019 08:10 PM

Puffin said:
After memory mapping a large file that is already in disk cache (Windows 10, 64-bit), why is it 10x slower to read from it for the first time than reading regular RAM (that is not in CPU cache) or than reading it the second time? After allocating a large memory block (e.g. 100MB), why is it 10x slower to read / write it for the first time than on subsequent times (without CPU cache in either case)? Not explained by memory zeroing alone, because it's indeed 10x slower than simple zeroing. Because the two cases above have the same performance, I wonder if they are caused by the same underlying reason (e.g. page fault overhead), and if there's anything that would help? I tried VirtualLock() to map several pages at once, but it wasn't any faster.

I think I see the confusion.

Memory-mapped file access does not go through the cache manager. CreateFile() and similar disk reading uses a cache which is independent of mapping, the hints you mention affect only that cache.

The cache difference is actually a reason to choose file memory mapping because it doesn't influence or pollute the rest of the system's disk cache. The mapped file can be used to share data between two processes, and the cache manager would severely complicate that.

If you're going to reuse the data, leave the mapping open until you're actually done with it. The same thing with reading a file multiple times, don't do that: try to read a file only once.

There may still be OS-level and hardware-level caches beyond the cache manager that CreateFile() uses, which further complicate the timings. But even so, CreateFileMapping() and CreateFile() use very different and quite independent mechanisms to cache and control access to data.

Puffin

3

Author

December 03, 2019 11:01 PM

Umm, no, I don't think that's it either :). I actually knew already before making the first test that the sequential-scan flag probably doesn't affect memory mapped files, but I just tried everything to be sure, and mentioned a few things so that people wouldn't think I've just made a beginner's mistake and concentrate on trying to find it instead of the actual issue. And even if I had mistaken about the flags, the cached memory mapped file was still 10x slower than expected, regardless of whether the caching thing is called "cache manager" or something else.

Anyway, I tried to exhaust the possibility of a testing error already before asking the question, and it's unlikely that there is one (although it's never exactly 100% sure), especially because other people have similar findings, e.g. the link I posted.

Puffin

3

Author

December 04, 2019 11:10 AM

Summary / conclusion

Skyler York · 2019-12-13T23:27:28

a light breeze said:If the file fits into physical memory, you would only need shared memory, not memory-mapped files. If not, then I agree that that's a fairly compelling case for memory-mapped files.On Windows you use them the same way; for shared memory, a size is provided instead of a file name and it's back by the page file. So whether you use one or the other comes down to other factors, such as whether or not the data already exists in a file, is generated at run-time, needs to be writable, should be persisted back to disk, etc.Puffin said:But if you can load to existing buffers, or use small buffers for streaming, then fread() is faster. I mentioned a few examples of such cases in my last post.Sure, but the analogous case for memory-mapped files is opening the mapping just once for all the tests, the same way you only allocate an fread() buffer once. Since that test didn't exist, the ones I mentioned were the only ones with the same overhead incurred.

Here's a quick summary for those interested in the original question (this thread got a little sidetracked although there's valid info in it about memory mapped files in general):

The observation:

When reading bulk data from a file that is already cached in memory by Windows, using a memory mapped file, the performance is surprisingly slow: ~10x slower than reading from main memory.
When reading the same data another time without closing the file/mapping and reopening it, it is as fast as reading from memory.
If the file is closed and immediately reopened, it is slow again.
fread() performs ~2x better in this case.
Reading a newly-malloc'd large memory block (with size at least 0.5-1.0 MB, which triggers VirtualAlloc()) is equally slow: 10x slower than reading already-allocated memory.

Why this occurs:

Unknown in detail, but probably Windows page mapping code is suboptimal (see e.g. https://randomascii.wordpress.com/2014/12/10/hidden-costs-of-memory-allocation/ and the links within, and similar stuff on the internet).

Why is this an issue:

When restarting a game in development environment, each GB loaded from files (or disk cache) and each GB memory allocated takes extra 0.5 seconds, meaning the restart time can become several seconds longer.
For the end-user, it's not a great issue, because the few extra seconds at game startup are not a big deal (unless the game keeps reopening the files in midgame too).
Other potential situations where the performance of reading large data must be as good as possible.

Workarounds:

Avoid closing the file if you're going to read from it again.
You can use fread() to read bulk data faster, but note that there are drawbacks:
** You might need to keep another copy of the data in memory (in addition to the OS cache), at least if it's not loaded to the GPU, which increases memory usage and reduces cache effectiveness.
** If the issue gets fixed in a Windows upate, then things might turn around and memory mapped files become faster.
For memory allocation, simply allocate large buffers at startup.

Below is the source code for a test to reproduce the results (VS 2017, Windows 10, 64bit). Configure a semi-large file that fits in disk cache but not in CPU cache (e.g. 100 MB). Let VS cool down for a while after compiling to prevent it from affecting results (see the Task Manager).

Here's the output of a test run. If the tool has not been run earlier for the same file, then the first round of the first test will show lower performance as the file is read from the disk.

File size: 88.6977 MB
memory mapped file: 2.30145 GB/s
memory mapped file: 2.27313 GB/s
memory mapped file: 2.29355 GB/s
memory mapped file: 2.30721 GB/s
memory mapped file: 2.31215 GB/s
fread() to reused buffer: 4.38877 GB/s
fread() to reused buffer: 4.18842 GB/s
fread() to reused buffer: 4.30015 GB/s
fread() to reused buffer: 4.31134 GB/s
fread() to reused buffer: 4.47193 GB/s
fread() to allocated buffer: 2.4394 GB/s
fread() to allocated buffer: 2.41685 GB/s
fread() to allocated buffer: 2.45084 GB/s
fread() to allocated buffer: 2.41161 GB/s
fread() to allocated buffer: 2.33594 GB/s
fread() stream: 5.7588 GB/s
fread() stream: 5.57706 GB/s
fread() stream: 5.74989 GB/s
fread() stream: 5.8003 GB/s
fread() stream: 5.83995 GB/s
read from reused memory block: 22.6261 GB/s
read from reused memory block: 22.4287 GB/s
read from reused memory block: 22.2942 GB/s
read from reused memory block: 20.2503 GB/s
read from reused memory block: 19.8635 GB/s
read from allocated memory block: 2.20512 GB/s
read from allocated memory block: 2.2249 GB/s
read from allocated memory block: 2.2074 GB/s
read from allocated memory block: 2.24883 GB/s
read from allocated memory block: 2.2159 GB/s

#include <windows.h>
#include <string>
#include <iostream>
#include <functional>


using namespace std;


string path = "TODO"; TODO: path to ~100MB file
int64_t dummySum = 0;


double getTimeSeconds() {
	LARGE_INTEGER currentTime;
	LARGE_INTEGER frequency;
	QueryPerformanceCounter(&currentTime);
	QueryPerformanceFrequency(&frequency);
	return (double)currentTime.QuadPart / frequency.QuadPart;
}


size_t processData(void *data, size_t size) {
	int64_t sum = 0;
	int64_t count = size >> 3; // The last few bytes are not processed if not divisible by 8.
	for (int64_t i = 0; i < count; i++) {
		sum += ((int64_t *)data)[i];
	}
	dummySum += sum;
	return size;
}


size_t getFileSize() {
	HANDLE fileHandle = CreateFileA(path.c_str(), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
	if (fileHandle == INVALID_HANDLE_VALUE) throw exception("Error opening file");
	LARGE_INTEGER fileSize;
	if (!GetFileSizeEx(fileHandle, &fileSize)) throw exception("Error getting file size");
	size_t size = fileSize.QuadPart;
	if (!CloseHandle(fileHandle)) throw exception("Error closing file");
	return size;
}


size_t readMemoryMappedFile() {
	HANDLE fileHandle = CreateFileA(path.c_str(), GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
	if (fileHandle == INVALID_HANDLE_VALUE) throw exception("Error opening file");


	LARGE_INTEGER fileSize;
	if (!GetFileSizeEx(fileHandle, &fileSize)) throw exception("Error getting file size");
	size_t size = fileSize.QuadPart;


	HANDLE mappingHandle = CreateFileMappingA(fileHandle, NULL, PAGE_READONLY, 0, 0, NULL);
	if (mappingHandle == INVALID_HANDLE_VALUE) throw exception("Error creating file mapping");


	void *ptr = MapViewOfFile(mappingHandle, FILE_MAP_READ, 0, 0, 0);
	if (!ptr) throw exception("Error mapping view of file");


	processData(ptr, size);


	if (!UnmapViewOfFile(ptr)) throw exception("Error unmapping view of file");
	if (!CloseHandle(mappingHandle)) throw exception("Error closing file mapping");
	if (!CloseHandle(fileHandle)) throw exception("Error closing file");
	return size;
}


size_t readFileWithFread(void *buffer, int64_t chunkSize) {
	FILE *file;
	if (fopen_s(&file, path.c_str(), "rb") != 0) throw exception("Error opening file");


	setvbuf(file, NULL, _IONBF, 0); // Seems to make no difference.


	_fseeki64(file, 0, SEEK_END);
	int64_t size = _ftelli64(file);
	_fseeki64(file, 0, SEEK_SET);


	if (chunkSize == -1) { // Chunk not used.
		int64_t bytesRead = fread_s(buffer, size, sizeof(BYTE), size, file);
		if (bytesRead < size) throw exception("Error reading file");
		processData(buffer, size);
	}
	else {
		for (int64_t offset = 0; offset < size;) {
			int64_t bytesToRead = min(chunkSize, size - offset);
			int64_t bytesRead = fread_s(buffer, bytesToRead, sizeof(BYTE), bytesToRead, file);
			if (bytesRead < bytesToRead) throw exception("Error reading file");
			processData(buffer, bytesToRead);
			offset += bytesToRead;
		}
	}
	
	fclose(file);
	return size;
}


void runTest(string name, const function<int64_t()> &task) {
	for (int round = 0; round < 5; round++) {
		double startTime = getTimeSeconds();
		int64_t bytesProcessed = task();
		double timeElapsed = getTimeSeconds() - startTime;
		cout << name << ": " << ((double)bytesProcessed / (1024 * 1024 * 1024) / timeElapsed) << " GB/s" << endl;
	}
}


int main() {
	size_t size = getFileSize();
	void *reusableBuffer = malloc(size);
	processData(reusableBuffer, size); // Warm-up the reusable buffer because it's slow on first access.


	cout << "File size: " << (size / 1048476.0) << " MB" << endl;
	
	runTest("memory mapped file", [&]() {
		return readMemoryMappedFile();
	});


	runTest("fread() to reused buffer", [&]() {
		return readFileWithFread(reusableBuffer, -1);
	});


	runTest("fread() to allocated buffer", [&]() {
		void *temp = malloc(size);
		return readFileWithFread(temp, -1);
		free(temp);
	});


	runTest("fread() stream", [&]() {
		return readFileWithFread(reusableBuffer, 256 * 1024);
	});


	runTest("read from reused memory block", [&]() {
		return processData(reusableBuffer, size);
	});


	runTest("read from allocated memory block", [&]() {
		void *temp = malloc(size);
		processData(temp, size);
		free(temp);
		return size;
	});


	if (dummySum == -5) cout << "test" << endl; // Make sure compiler doesn't remove "unnecessary" code.
	return 0;
}

frob

46,226

December 04, 2019 04:52 PM

Puffin said:
observation: When reading bulk data from a file that is already cached in memory by Windows, using a memory mapped file, the performance is surprisingly slow: ~10x slower than reading from main memory.

memory mapped file: 2.30145 GB/s
memory mapped file: 2.27313 GB/s
memory mapped file: 2.29355 GB/s
memory mapped file: 2.30721 GB/s
memory mapped file: 2.31215 GB/s

Your observation there is in error. It is not "already cached in memory by Windows".

Repeating: Memory mapped files do not use the cache. This is intentional, as-designed behavior, in part because memory mapped files are a method for IPC. If somebody created a cache for it, the IPC behavior of memory mapped files would break. Every time map the file you're going to have the same cost, it will never be cached.

That is precisely what I would expect to see.

The benefit in memory mapped files is not that that the file is cached. The benefit is that the data is placed directly into memory so it can be used immediately without any other processing, copying, or additional buffers involved.

undefined said:
* When reading the same data another time without closing the file/mapping and reopening it, it is as fast as reading from memory. * If the file is closed and immediately reopened, it is slow again.

Yes, this is the expected and well documented behavior of the file cache. Open the files, keep them open as you use them. If you close them the cache isn't maintained. Open it with the random access flag if you intend to jump around or repeatedly re-read data, open it with the sequential flag if you intend to only read once linearly through the file.

Don't re-open and re-read data. That's a symptom of bad programming on your end, not a flaw in the file system

Regarding other parts of your code, you need to learn how to accurately profile your code. The correct tool is a profiler. Then you can see EXACTLY which operation is taking the time. Is it in the memory allocator? Is it in the disk IO? Is it in the scan of the data? Your code doesn't accurately measure where the performance issue is located.

Using a profiler you can state with absolute certainty where the time is spent, potentially with nanosecond precision. There would be none of this guesswork about if it is spent in VirtualAlloc() or in some other function.

Puffin

3

Author

December 04, 2019 07:59 PM

Edit: I did one more test, because the claim above that memory mapped files wouldn't use a cache sounded unbelievable, and indeed confirmed that it does use a cache, also when the file is closed and reopened. The test has a list of ~4-6 GB files and runs in rounds so that on the first round, it reads file 0 only, on the next round, it reads files 0, 1, then 0, 1, 2 and so on. Each time a file is read the first time (in this case from a HDD), it is very slow, about 0.15 GB/s. When it's closed and reopened, it runs at the ~2 GB/s familiar from the earlier tests. This continues until we have 7 files in the loop, at which time the 32 GB memory of the computer is no longer sufficient to hold all the files and each file starts being slow at first read again. (Each file is read 5 times in a row too, but that's not so important). See the output below.

Regarding profiling, I tested that the time is spent when touching each 4k page the first time, as I mentioned earlier. The test tool I posted was a simplified version and I did more tests actually, but the tool already shows the main observation.

Now, I must point out that while I really appreciate people trying to help, this thread has been littered with hasty speculation and such, and that makes it difficult for people to spot the real information from all the noise. So, please if anyone has further input on this matter, especially if someone still suspects that the test must be wrong, please spend a few minutes thinking open-mindedly about your conclusions and whether you might have fallen to cognitive biases and such, and preferably even verify your claims before posting.

Testing with 1 files, size: 3.8453 GB
read file 0: 0.161158 GB/s
read file 0: 2.27503 GB/s
read file 0: 2.30143 GB/s
read file 0: 2.31184 GB/s
read file 0: 2.25945 GB/s
Testing with 2 files, size: 7.91278 GB
read file 0: 2.25269 GB/s
read file 0: 2.27018 GB/s
read file 0: 2.27609 GB/s
read file 0: 2.28093 GB/s
read file 0: 2.28808 GB/s
read file 1: 0.128606 GB/s
read file 1: 2.23911 GB/s
read file 1: 2.27903 GB/s
read file 1: 2.2653 GB/s
read file 1: 2.2759 GB/s
Testing with 3 files, size: 11.9622 GB
read file 0: 2.28108 GB/s
read file 0: 2.26874 GB/s
read file 0: 2.27172 GB/s
read file 0: 2.29835 GB/s
read file 0: 2.28801 GB/s
read file 1: 2.30061 GB/s
read file 1: 2.26819 GB/s
read file 1: 2.30155 GB/s
read file 1: 2.28455 GB/s
read file 1: 2.32111 GB/s
read file 2: 0.149842 GB/s
read file 2: 2.2574 GB/s
read file 2: 2.22513 GB/s
read file 2: 2.17424 GB/s
read file 2: 2.24978 GB/s
Testing with 4 files, size: 16.014 GB
read file 0: 2.25161 GB/s
read file 0: 2.28478 GB/s
read file 0: 2.29664 GB/s
read file 0: 2.29917 GB/s
read file 0: 2.27871 GB/s
read file 1: 2.27172 GB/s
read file 1: 2.26835 GB/s
read file 1: 2.29466 GB/s
read file 1: 2.25102 GB/s
read file 1: 2.27446 GB/s
read file 2: 2.27668 GB/s
read file 2: 2.25621 GB/s
read file 2: 2.2744 GB/s
read file 2: 2.28667 GB/s
read file 2: 2.27271 GB/s
read file 3: 0.157595 GB/s
read file 3: 2.18707 GB/s
read file 3: 2.20015 GB/s
read file 3: 2.18768 GB/s
read file 3: 2.19486 GB/s
Testing with 5 files, size: 20.3426 GB
read file 0: 2.23111 GB/s
read file 0: 2.28261 GB/s
read file 0: 2.27678 GB/s
read file 0: 2.28707 GB/s
read file 0: 2.30106 GB/s
read file 1: 2.30272 GB/s
read file 1: 2.27189 GB/s
read file 1: 2.27431 GB/s
read file 1: 2.28354 GB/s
read file 1: 2.27452 GB/s
read file 2: 2.27336 GB/s
read file 2: 2.27045 GB/s
read file 2: 2.29339 GB/s
read file 2: 2.27447 GB/s
read file 2: 2.27552 GB/s
read file 3: 2.1899 GB/s
read file 3: 2.20295 GB/s
read file 3: 2.19281 GB/s
read file 3: 2.18778 GB/s
read file 3: 2.20614 GB/s
read file 4: 0.15085 GB/s
read file 4: 2.19638 GB/s
read file 4: 2.18467 GB/s
read file 4: 2.17909 GB/s
read file 4: 2.1748 GB/s
Testing with 6 files, size: 24.6711 GB
read file 0: 2.22972 GB/s
read file 0: 2.27863 GB/s
read file 0: 2.29089 GB/s
read file 0: 2.27362 GB/s
read file 0: 2.29447 GB/s
read file 1: 2.28626 GB/s
read file 1: 2.29658 GB/s
read file 1: 2.28483 GB/s
read file 1: 2.2758 GB/s
read file 1: 2.2757 GB/s
read file 2: 2.27985 GB/s
read file 2: 2.2837 GB/s
read file 2: 2.28713 GB/s
read file 2: 2.29311 GB/s
read file 2: 2.27372 GB/s
read file 3: 2.21585 GB/s
read file 3: 2.20784 GB/s
read file 3: 2.19334 GB/s
read file 3: 2.19479 GB/s
read file 3: 2.21352 GB/s
read file 4: 2.17858 GB/s
read file 4: 2.17944 GB/s
read file 4: 2.189 GB/s
read file 4: 2.17812 GB/s
read file 4: 2.18117 GB/s
read file 5: 0.158468 GB/s
read file 5: 2.27109 GB/s
read file 5: 2.26732 GB/s
read file 5: 2.26839 GB/s
read file 5: 2.27911 GB/s
Testing with 7 files, size: 29.6941 GB
read file 0: 0.148311 GB/s
read file 0: 2.2711 GB/s
read file 0: 2.28392 GB/s
read file 0: 2.27789 GB/s
read file 0: 2.28026 GB/s
read file 1: 0.128761 GB/s
read file 1: 2.27595 GB/s
read file 1: 2.27344 GB/s
read file 1: 2.27581 GB/s
read file 1: 2.26198 GB/s
read file 2: 0.149518 GB/s
read file 2: 2.19985 GB/s
read file 2: 2.19653 GB/s
read file 2: 2.19432 GB/s
read file 2: 2.19065 GB/s
read file 3: 0.15761 GB/s
read file 3: 2.15646 GB/s
read file 3: 2.15891 GB/s
read file 3: 2.13342 GB/s
read file 3: 2.18116 GB/s
read file 4: 0.151039 GB/s
read file 4: 2.2713 GB/s
read file 4: 2.24722 GB/s
read file 4: 2.24693 GB/s
read file 4: 2.2563 GB/s
read file 5: 0.158242 GB/s
read file 5: 2.2579 GB/s
read file 5: 2.27351 GB/s
read file 5: 2.27731 GB/s
read file 5: 2.26775 GB/s
read file 6: 0.162105 GB/s
read file 6: 2.26674 GB/s
read file 6: 2.26743 GB/s
read file 6: 2.25805 GB/s
read file 6: 2.273 GB/s
Testing with 8 files, size: 35.587 GB
read file 0: 0.14821 GB/s
read file 0: 2.16696 GB/s
read file 0: 2.1757 GB/s
read file 0: 2.16752 GB/s
read file 0: 2.16797 GB/s
read file 1: 0.128421 GB/s
read file 1: 2.19937 GB/s
read file 1: 2.18311 GB/s
read file 1: 2.20003 GB/s
read file 1: 2.19402 GB/s

Puffin

3

Author

December 04, 2019 07:59 PM

Adam_42

3,664

December 04, 2019 11:19 PM

I'd suggest the SEC_LARGE_PAGES flag, but it's apparently not supported for files.

My only other thought is to try to use multiple threads to split the overhead of the per-page cost across different CPU cores.

frob

46,226

December 05, 2019 12:24 AM

How it gets cached is a bit different.

To begin with, as you likely know, they are a virtual memory area that is backed by the file system rather than backed by the normal paging system. In that regard, the process just treats it as virtual memory, and VM mapping works its magic.

With normal memory areas, when you read or write the memory, the normal virtual memory system will ensure the correct memory (the virtual memory mapping) is paged in, transferring data to and from the drive (from the paging files) if necessary to ensure the virtual memory access appears correct to the program.

With memory mapped file views, when you read or write the memory, exactly the same thing happens. The system ensures the correct memory (a mapped view of the file) is paged in, transferring data to and from the drive (from the data file) if necessary to ensure the virtual memory access appears correct to the program.

Notice the two are identical. The only difference is that virtual memory addresses are backed by a file.

The file system cache treats memory mapped files as potentially shared objects. Reads and writes are managed through the virtual memory system. When data needs to be loaded due to a VM miss, the OS loads it from the memory mapped file. When the OS wants to reclaim unused memory, the OS can purge blocks. When multiple processes share the memory, the view is shared with multiple processes. Critically, there is only a single copy of the data which is used.

In contrast, opening the files through CreateFile() or fopen() or similar use the Windows file cache manager. When you read or write to the file handle, a chunk of data is loaded from the disk to the system cache managed by the OS (e.g., 256KB block is reserved within the system cache, then 256KB is loaded from disk). The system cache may have many thousand blocks available and loaded from various files. Next, the system duplicates the data -- a second copy of the data -- and sends that over to your process's address space as data that was read. Often the data is parsed and stored -- a third copy of the data -- before the second copy is released, then the file is discarded.

The Windows file system cache is managed on a file object basis as a virtual block cache. When you close the file, the cache is released, and it cannot be reused for another process loading the file.

You might not be as familiar with those caches so I'll explain them too. There are two options, a logical block cache and a virtual block cache. A logical block cache works by mirroring what is on the physical disk; the disk's logical block address is loaded and kept around. Because a logical block cache represents what is physically on the disk at that location, it can be shared and reused. Window's virtual block cache is based on blocks of a file regardless of their location on the disk; the data is cached regardless of its location on the disk. Because the virtual block cache only represents the data, this allows other systems to modify the underlying details of a file, enabling features like shadow files. (This allows backup systems to open a shadow copy of the file that remains valid even when another program opens and writes to a file, the virtual blocks are the ones that matter, not the logical blocks as they exist on the physical disk.)

All of this means that if you open a file, close the file, then open the file again, you don't get any caching benefits from Windows file system cache manager. The benefit is to open a file and use random access to jump around, or when doing many small reads rather than large block reads.

Many hard drives have their own cache mechanisms, and they typically implement logical block caches. SSDs tend to have smaller caches, and NVMe disks are different as well.

So putting those all together...

Because you opened a memory mapped view of a file, traversed it once, then closed the memory mapped view, then opened it again and repeated the process, every time you should expect to see exactly the same performance. Nothing gets cached.

If you had opened one program with a memory mapped view of a file, left the file open, and then opened a second memory mapped view of the file, the second memory mapped view would be much faster, potentially near-instant because the virtual memory is already mapped into RAM.

When you closed the memory mapped view of the file, in addition to having the virtual memory saved out, the virtual blocks it was using are also discarded because file blocks don't survive when the file object is closed. The next time you opened it, new virtual blocks were used.

It is possible that other caches were involved, such as a cache on the disk drive itself, or caches by the disk drive controller, which stored the logical blocks from the disk for faster access. That would be independent of what Windows or your program were doing.

When you opened a copy of the file with fread or CreateFile(), blocks of the file were loaded into windows file system cache. As you read the file blocks were loaded probably as fast as they could be copied. Those triggered more memory allocations. Then you allocated new blocks of memory and made a second copy of the data for your process, which you used.

When you closed the file objects, the windows file system cache discarded the old virtual blocks because they were no longer in use. The next time you opened it, new virtual blocks were used.

Just like with the mapping, it is possible that other caches were involved on the disk drive or the drive's controller, but they were independent of Windows and your program.

Each of those memory allocations were secondary acts that could have had a performance imapct. Exactly how fast the memory allocations were isn't measured in your example. A profiler could tell you exactly how long they took. As you pointed out with your link, memory allocations have variable amounts of time depending on what the OS needs to do. Worst case it needs to reclaim virtual memory space and wipe the memory clean for security purposes before it can assign it to your process and then let your memory allocator subdivide it for use. Best case it already has some memory clean and allocated to your process so the only work is in your own memory allocator. Frequently it must search the global allocation tables for an already-zeroed large block, assign the block from they system open pool to your process, then perform the allocation from the block within your process.

Puffin

3

Author

December 05, 2019 09:26 AM

Regarding large pages, I thought of that too and it indeed didn't work out, and got a kind-of impression that the large pages are a special feature that might have other issues.

Regarding multithreading, some sources said that the Windows page mapping stuff is single-threaded and the situation is even worse with multithreading, others said that this has been fixed in 2017 Creators Update, but I didn't verify any of this yet. This is anyway something one should check before using.

Regarding caching: I specifically meant that the claim that memory mapped files wouldn't stay in main memory at all after all instances of the file are closed, is incorrect, as shown both by the tests and several other points I mentioned earlier. The suggestion that it would be cached in on-disk cache is incorrect as well, because it's too small to hold 30 GB and unlikely to have the same performance on different disks etc. You mentioned that if I opened a second instance of the same file when it's already open, it should be fast, but I tested this earlier too, at that assumption is incorrect also: reading from the second view is equally slow than from the first one. The file is already in main memory and available for both the first instance (cached from earlier accesses) and for the second instance, but there's the page-mapping overhead in both instances. You mentioned also claims about copying and zeroing and such, but those were discussed and refuted earlier already.

For anyone seeing this thread and interested in the original subject and the end result, see my earlier post with the big "summary / conclusion" title. The most relevant information is there, with some details in the first post too.

Zipster

2,420

December 05, 2019 09:28 PM

Memory-mapped files aren't some drop-in replacement for your conventional file I/O that will magically make things perform better, or make your code simpler. There is a proper way to use them for which you have to specifically design. Any contrived test case that tries to compare them apples-to-apples to conventional I/O is incorrectly designed, and will lead to these misguided metrics and conclusions.

I would advise anyone reading this thread to do their own research and learn about how and when memory-mapped files are useful.

Windows memory mapped file twice as slow as fread() when cached

Summary / conclusion

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Windows memory mapped file twice as slow as fread() when cached

Summary / conclusion

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines