NFS Coding HOWTO

Copyright © 2007-2011 Timo Sirainen <[email protected]>.

Abstract

This HOWTO describes the differences how NFS works compared to POSIX-compliant filesystems from an application programmer's point of view. I had to learn about most of these issues by trial and error, a lot of googling and by looking at kernel sources to figure out why things worked the way they did. I hope this HOWTO will make life easier for other people.

This HOWTO mainly applies to Linux, FreeBSD and Solaris where I have had the most experience. My experience has been mainly with NFS version 3, but this HOWTO contains also some preliminary points about NFS version 4.

I still cannot call myself an NFS expert though, so it is possible that this HOWTO has some mistakes or misunderstandings. Corrections, contributions and other feedback is welcome.

NFS Design

Up to version 3 NFS is a stateless protocol. There is no "open file" step when accessing files, but instead all files can be accessed directly by their file handle. The file handles are looked up using a LOOKUP command. To avoid doing these lookups constantly, NFS clients cache them internally. See RFC 1813 [1] for more details.

NFS version 4 is a stateful protocol. The NFS server knows what NFS clients have a file open. NFSv4 also supports delegations which allow an NFS client to update data locally until NFS server requests it for another client. I have heard that this is supposed to fix most of the caching problems discussed below, but I have yet to see such an NFS implementation.

File Deletions and ESTALE Errors

This chapter applies only to NFS versions older than 4.

Once a file's link count drops to zero (i.e. the file is deleted), in POSIX-compatible filesystems the file can still be accessed by the processes that have it open. NFS does not support this, so the NFS client's kernel has to fake it. If kernel knows that there are open file descriptors for the file when it is being unlinked, instead of unlinking the file it is renamed to .nfs* where * is a unique identifier. This is called a "silly rename". After all processes have closed the file, the .nfs* file is deleted. This means that if kernel crashes before deleting the file, it will not get deleted automatically anymore.

Silly renaming works as long as there is only one NFS client accessing the file. An NFS client cannot know if other clients have the file open, so it is possible that the file gets deleted from the NFS server while some clients still have it open. This causes all future operations for the file to fail with ESTALE error. This includes calls to read(), write(), fstat() and so on.

It is understandable that operations accessing open files fail with ESTALE, but it is a bit unexpected that even operations accessing unopened files can fail with ESTALE, for example open() and stat(). This happens if kernel tries to use a cached file handle to access the file, but the file had since been deleted. The operation will most likely work as expected after kernel has refreshed the file handle. This should not happen in Linux v2.6.12 and later anymore[2], but FreeBSD 6.2 still gives them.

You can avoid handling ESTALE for open() (etc.) all around in your code by creating a wrapper. For example for open():

  1. Call open(). If it returns anything else than ESTALE, return from the function.
  2. Flush attribute cache for the parent directory (see the Attribute Caching section below).
  3. stat() the parent directory to see if it still exists. If it returns anything except success, fail with ESTALE.
  4. Try again up to n times.

Attribute Caching

Attribute cache caches everything in struct stat, so stat() and fstat() calls can be returned from cache. If you need to see a file's latest size or mtime (or other fields), you need to flush the file's attribute cache before the stat() call.

Note that if file handle is cached, stat() returns information for that cached file (so the result is the same as for fstat()). If you need to stat() the latest file with the given file name, flush the file handle cache first.

A few possibilities to flush a file's attribute cache:

File Handle Caching

Directories cache file names to file handles mapping. The most common problems with this are:

A few ways to flush the file handle cache:

Read Caching

When data is read from a file, kernel may cache it so that future reads can return the cached data directly. This read cache is flushed when kernel notices that the file's mtime has changed.

Keeping track of file size changes belongs to attribute cache handling. If kernel has cached a file's size, but more data has since been appended to the file, trying to read() the new data results in EOF unless you flush the attribute cache first.

If the NFS server's filesystem supports nanosecond or microsecond resolution on timestamps, you most likely do not need to worry about read cache flushing. Just flush the attribute cache to get kernel to notice the mtime change. With Linux you must use open()+close() method for this attribute cache flush, other methods do not work. If the filesystem supports only one second resolution, you could still use utime() after writes to make sure the mtime always grows.

Locking a file successfully with fcntl() flushes the read cache with Linux. This requires that locks are working, so either you need to have lockd running or the filesystem needs to be mounted with nolock. FreeBSD does not flush read cache with fcntl() locks.

It is possible to bypass read caching completely for a file by opening it with O_DIRECT flag. However this does not work with all systems. Linux supports it only if the kernel was compiled with CONFIG_NFS_DIRECTIO. FreeBSD requires setting vfs.nfs.nfs_directio_enable sysctl. Solaris has supported it since v9 Update 5 (but mount -o forcedirectio since v2.6).

Write Caching

Kernel may cache writes to a file locally until fsync() or fdatasync() is called. The write cache is also flushed when closing the file, which explains why close() may return a failure. If you care about knowing a file's exact mtime, remember that it may be changed by close() if you did not fsync.

Opening a file with O_DSYNC or O_SYNC flag causes writes to the file to be sent to the server directly. Changing it later with fcntl() does not work in Linux though.

Other Issues

Memory mapped files work more or less well, but the error handling is not very nice. If a read fails for any reason while a page is being faulted, the process gets killed by a SIGBUS signal.

Linux's link(2) manual page says that link() may return EEXIST even though it created the file. This apparently is a quite rare problem nowadays when servers support duplicate request cache [3]. FreeBSD (at least up to v6.2) has another problem though: Its link() may return success even though it actually returned EEXIST failure [3]. Both of these issues can be worked around by calling fstat() for the file after link() and see if its link count increased.

Linux's open(2) manual page says O_EXCL does not work with NFS, but it actually does work if both the NFS client (Linux v2.6.6+[4]) and the server support it. Apparently it is commonly implemented nowadays, so it should be quite safe to use in new systems. Unfortunately there is no easy way to check if it is safe or not.

NFS (even v4) does not support O_APPEND. Kernel fakes it by writing to what it thinks is the current end of file offset. That may easily cause the file to get corrupted.

flock() may or may not work. Recent Linux (v2.6.12+[4]), FreeBSD and Solaris (v2.5.1+) kernels emulate it using fcntl() locks. With older kernels it locks the file only locally.

O_EXCL works with NFS version 3+ if both the client and the server support it. Probably all products (modern ones at least) that support NFSv3 support also O_EXCL. It's not possible to test this however and if one of them doesn't support it, the result is pretty much the same as if you didn't use O_EXCL.

NFSv4

Some preliminary points about NFSv4:

NFS Cache Tester

NFS cache tester[6] can be used to easily test how caches can be flushed in different operating systems. Successful flushing methods are listed below for some operating systems:

Note that some results may be different depending on whether the NFS server supports nanosecond or microsecond timestamp resolution. This is because the NFS client may flush its caches when noticing that mtime or ctime changed.

Linux

Results are the same with both NFSv3 and NFSv4.

timestamps resolution: seconds

Testing file attribute cache..
Attr cache flush open+close: OK
Attr cache flush close+open: OK
Attr cache flush fchown(uid, -1): OK
Attr cache flush fchmod(mode): OK
Attr cache flush chown(uid, -1): OK
Attr cache flush chmod(mode): OK
Attr cache flush fcntl(shared): OK
Attr cache flush fcntl(exclusive): OK
Attr cache flush flock(shared): OK
Attr cache flush flock(exclusive): OK

Testing data cache..
Data cache flush fcntl(shared): OK
Data cache flush fcntl(exclusive): OK
Data cache flush flock(shared): OK
Data cache flush flock(exclusive): OK
Data cache flush O_DIRECT: OK

Testing write flushing..
Write flush open+close: OK
Write flush close+open: OK
Write flush fchown(uid, -1): OK
Write flush fchmod(mode): OK
Write flush chown(uid, -1): OK
Write flush chmod(mode): OK
Write flush dup+close: OK
Write flush fcntl(shared): OK
Write flush fcntl(exclusive): OK
Write flush flock(shared): OK
Write flush flock(exclusive): OK
Write flush fsync(): OK
Write flush O_DIRECT: OK

Testing file handle cache..
File handle cache flush fchown(uid, -1): OK
File handle cache flush fchmod(mode): OK
File handle cache flush chown(uid, -1): OK
File handle cache flush chmod(mode): OK
File handle cache flush rmdir(parent dir): OK

Testing negative file handle cache..
Negative file handle cache flush fchown(uid, -1): OK
Negative file handle cache flush fchmod(mode): OK
Negative file handle cache flush chown(uid, -1): OK
Negative file handle cache flush chmod(mode): OK
Negative file handle cache flush rmdir(parent dir): OK

FreeBSD 6.2-STABLE

Same result with second and microsecond resolution.

Testing file attribute cache..
Attr cache flush open+close: OK
Attr cache flush close+open: OK
Attr cache flush fchown(-1, -1): OK
Attr cache flush fchown(uid, -1): OK
Attr cache flush fchmod(mode): OK
Attr cache flush chown(-1, -1): OK
Attr cache flush chown(uid, -1): OK
Attr cache flush chmod(mode): OK
Attr cache flush rmdir(): OK

Testing data cache..
Data cache flush O_DIRECT: OK

Testing write flushing..
Write flush open+close: OK
Write flush close+open: OK
Write flush fsync(): OK
Write flush fcntl(O_SYNC): OK

Testing file handle cache..
File handle cache flush rmdir(): OK
File handle cache flush rmdir(parent dir): OK

Testing negative file handle cache..
Negative file handle cache flush no caching: OK

Solaris 9, 10

Results are the same with both NFSv3 and NFSv4.

Negative file handle cache returns OK for everything if timestamps have microsecond resolution.

Testing file attribute cache..
Attr cache flush open+close: OK
Attr cache flush close+open: OK
Attr cache flush fchown(-1, -1): OK
Attr cache flush fchown(uid, -1): OK
Attr cache flush fchmod(mode): OK
Attr cache flush chown(-1, -1): OK
Attr cache flush chown(uid, -1): OK
Attr cache flush chmod(mode): OK
Attr cache flush fcntl(shared): OK
Attr cache flush fcntl(exclusive): OK

Testing data cache..
Data cache flush fcntl(shared): OK
Data cache flush fcntl(exclusive): OK

Testing write flushing..
Write flush open+close: OK
Write flush close+open: OK
Write flush fchown(-1, -1): OK
Write flush fchown(uid, -1): OK
Write flush fchmod(mode): OK
Write flush chown(-1, -1): OK
Write flush chown(uid, -1): OK
Write flush chmod(mode): OK
Write flush fcntl(shared): OK
Write flush fcntl(exclusive): OK
Write flush fsync(): OK
Write flush fcntl(O_SYNC): OK

Testing file handle cache..
File handle cache flush rmdir(parent dir): OK

Testing negative file handle cache..
Negative file handle cache flush rmdir(parent dir): OK

References