Sometimes you run across problems that arise at the intersection of various areas of responsibility. These issues are particularly challenging to resolve because the experts in any given area are not aware of the implications across the entire system. In these cases, it is important to have a generalist who can track the issue across the various components and create an understanding of the problem as it relates to the entire system.
In a recent case, we were installing an application at a client site. In these types of engagements, we normally have the client IT group set up the required hardware environment subject to our specifications. Our specifications are fairly accommodating so that the client can use their preferred suppliers and provision the hardware according to their own policies and procedures.
In this particular case, we validated the infrastructure and installed the application (Documentum with the xPlore full-text indexer). We migrated the client’s data into the Documentum repository and started to create the new full-text index with xPlore. This is normally a lengthy process (several hours or even tens of hours) so once we confirmed that the indexer was running, we left it with the intent to check on it later. When we came back to it, we found that the indexer had run out of memory and crashed. This is not an entirely unknown occurrence, especially when dealing with large amounts of content to be indexed, so we tweaked the memory configuration and started the indexing again.
Once again, when we returned to the indexer it had run out of memory and crashed. We had the IT group add more RAM to the server and started the process again. It seemed to be running fine at first but again when we came back to it later it had run out of memory and crashed. Looking at the Windows memory allocation it certainly didn’t look like the application was using too much RAM. We loaded some diagnostic tools onto the server and discovered that the problem wasn’t really the xPlore application. The problem was that Windows was gradually increasing the size of the non-paged pool, eventually consuming all of the RAM and leaving no memory available for allocation to the applications.
As it turns out, the client had installed Windows Server 2016 and had configured the disks as ReFS disks (because ReFS is “better”). One of the features of ReFS is that it performs lazy writes of file attributes, which is to say that it writes the content of the file to disk immediately but commits the attributes of the file to disk later when the disk is not busy. In the interim the attributes are cached in, you guessed it, non-paged pool. Normally this is not a problem but when you have an application that creates very high sustained disk I/O loads (like a full-text indexer but also backup programs) the disk never becomes available to write out the attribute cache and the size of the cache in non-paged pool grows until it consumes all available RAM. This is a known bug in the Windows Server 2016 implementation of ReFS. It has been fixed in Windows Server 2019 but the fix has not been back-ported to Server 2016.
We had the client re-format the drives as NTFS and the problem went away. We had the client re-provision the server with the original memory specification and everything still worked fine. The full-text indexing completed without error.
Long story short, we’ve changed our hardware specifications to specifically exclude ReFS drives in Windows Server 2016 implementations.