Copyright © 2007-2011 Timo Sirainen <[email protected]>.
This HOWTO describes the differences how NFS works compared to POSIX-compliant filesystems from an application programmer's point of view. I had to learn about most of these issues by trial and error, a lot of googling and by looking at kernel sources to figure out why things worked the way they did. I hope this HOWTO will make life easier for other people.
This HOWTO mainly applies to Linux, FreeBSD and Solaris where I have had the most experience. My experience has been mainly with NFS version 3, but this HOWTO contains also some preliminary points about NFS version 4.
I still cannot call myself an NFS expert though, so it is possible that this HOWTO has some mistakes or misunderstandings. Corrections, contributions and other feedback is welcome.
Up to version 3 NFS is a stateless protocol. There is no "open file" step when accessing files, but instead all files can be accessed directly by their file handle. The file handles are looked up using a LOOKUP command. To avoid doing these lookups constantly, NFS clients cache them internally. See RFC 1813 [1] for more details.
NFS version 4 is a stateful protocol. The NFS server knows what NFS clients have a file open. NFSv4 also supports delegations which allow an NFS client to update data locally until NFS server requests it for another client. I have heard that this is supposed to fix most of the caching problems discussed below, but I have yet to see such an NFS implementation.
This chapter applies only to NFS versions older than 4.
Once a file's link count drops to zero (i.e. the file is deleted), in
POSIX-compatible filesystems the file can still be accessed by the
processes that have it open. NFS does not support this, so the NFS client's
kernel has to fake it. If kernel knows that there are open file descriptors
for the file when it is being unlinked, instead of unlinking the file it is
renamed to .nfs*
where *
is a unique identifier.
This is called a "silly rename". After all processes have closed the file,
the .nfs*
file is deleted. This means that if kernel crashes
before deleting the file, it will not get deleted automatically
anymore.
Silly renaming works as long as there is only one NFS client accessing
the file. An NFS client cannot know if other clients have the file open, so
it is possible that the file gets deleted from the NFS server while some
clients still have it open. This causes all future operations for the file
to fail with ESTALE
error. This includes calls to
read()
, write()
, fstat()
and so
on.
It is understandable that operations accessing open files fail with
ESTALE
, but it is a bit unexpected that even operations
accessing unopened files can fail with ESTALE
, for example
open()
and stat()
. This happens if kernel tries
to use a cached file handle to access the file, but the file had since been
deleted. The operation will most likely work as expected after kernel has
refreshed the file handle. This should not happen in Linux v2.6.12 and later
anymore[2], but FreeBSD
6.2 still gives them.
You can avoid handling ESTALE
for open()
(etc.) all around in your code by creating a wrapper. For example for
open()
:
open()
. If it returns anything else than
ESTALE
, return from the function.stat()
the parent directory to see if it still exists.
If it returns anything except success, fail with ESTALE
.Attribute cache caches everything in struct stat
, so
stat()
and fstat()
calls can be returned from
cache. If you need to see a file's latest size or mtime (or other fields),
you need to flush the file's attribute cache before the
stat()
call.
Note that if file handle is cached, stat()
returns
information for that cached file (so the result is the same as for
fstat()
). If you need to stat()
the latest file
with the given file name, flush the file handle cache first.
A few possibilities to flush a file's attribute cache:
open()+close()
or close()+open()
should work
everywhere. The only problem with this is that it loses all the fcntl locks
you may have for the file (closing a file descriptor drops all fcntl locks
for that file in the process, even in different file descriptors).chown()
or fchown()
the
file with (uid_t)-1
as the UID and (gid_t)-1
as the GID.stat()
the file to see its
current owner and then chown()
or fchown()
the
file to the same UID (with (gid_t)-1
as the GID). Note that
the attribute cache is flushed only if the function returns a success.fcntl()
locking flushes attribute cache with Linux and
Solaris, but not with FreeBSD.Directories cache file names to file handles mapping. The most common problems with this are:
stat()
returns the new file's
information and not the opened file's.
fstat()
fails
with ESTALE
.A few ways to flush the file handle cache:
chown()
the directory to its current owner. The
file handle cache is flushed if the call returns successfully.rmdir()
the parent
directory. ENOTEMPTY
means the cache is flushed. Trying to
rmdir()
the current directory fails with EINVAL
and doesn't flush the cache.rmdir()
either
the parent directory or the file under it. ENOTEMPTY
,
ENOTDIR
and EACCES
failures mean the cache is
flushed, but ENOENT
did not flush it. FreeBSD does not cache
negative entries, so they do not have to be flushed.When data is read from a file, kernel may cache it so that future reads can return the cached data directly. This read cache is flushed when kernel notices that the file's mtime has changed.
Keeping track of file size changes belongs to attribute cache handling.
If kernel has cached a file's size, but more data has since been appended to
the file, trying to read()
the new data results in EOF unless
you flush the attribute cache first.
If the NFS server's filesystem supports nanosecond or microsecond
resolution on timestamps, you most likely do not need to worry about read
cache flushing. Just flush the attribute cache to get kernel to notice the
mtime change. With Linux you must use open()+close()
method
for this attribute cache flush, other methods do not work. If the
filesystem supports only one second resolution, you could still use
utime()
after writes to make sure the mtime always grows.
Locking a file successfully with fcntl()
flushes the read
cache with Linux. This requires that locks are working, so either you need
to have lockd running or the filesystem needs to be mounted with
nolock
. FreeBSD does not flush read cache with
fcntl()
locks.
It is possible to bypass read caching completely for a file by opening it
with O_DIRECT
flag. However this does not work with all
systems. Linux supports it only if the kernel was compiled with
CONFIG_NFS_DIRECTIO
. FreeBSD requires setting
vfs.nfs.nfs_directio_enable
sysctl. Solaris has supported it
since v9 Update 5 (but mount -o forcedirectio since v2.6).
Kernel may cache writes to a file locally until fsync()
or
fdatasync()
is called. The write cache is also flushed when
closing the file, which explains why close()
may return a
failure. If you care about knowing a file's exact mtime, remember that it
may be changed by close()
if you did not fsync.
Opening a file with O_DSYNC
or O_SYNC
flag
causes writes to the file to be sent to the server directly. Changing it
later with fcntl()
does not work in Linux though.
Memory mapped files work more or less well, but the error handling is not
very nice. If a read fails for any reason while a page is being faulted, the
process gets killed by a SIGBUS
signal.
Linux's link(2) manual page says that link()
may return
EEXIST
even though it created the file. This apparently is a
quite rare problem nowadays when servers support duplicate request cache
[3].
FreeBSD (at least up to v6.2) has another problem though: Its
link()
may return success even though it actually returned
EEXIST
failure
[3].
Both of these issues can be worked around by calling fstat()
for the file after link()
and see if its link count
increased.
Linux's open(2) manual page says O_EXCL
does not work with
NFS, but it actually does work if both the NFS client (Linux
v2.6.6+[4]) and
the server support it. Apparently it is commonly implemented nowadays, so
it should be quite safe to use in new systems. Unfortunately there is no
easy way to check if it is safe or not.
NFS (even v4) does not support O_APPEND
. Kernel fakes it by
writing to what it thinks is the current end of file offset. That may easily
cause the file to get corrupted.
flock()
may or may not work. Recent Linux
(v2.6.12+[4]),
FreeBSD and Solaris (v2.5.1+) kernels emulate it using fcntl()
locks. With older kernels it locks the file only locally.
O_EXCL
works with NFS version 3+ if both the client and the server support it. Probably all products (modern ones at least) that support NFSv3 support also O_EXCL
. It's not possible to test this however and if one of them doesn't support it, the result is pretty much the same as if you didn't use O_EXCL
.
Some preliminary points about NFSv4:
ESTALE
errors.ESTALE
errors are gone, but caches behave exactly as
with NFSv3. Except for one bug in v2.6.23[5].ESTALE
errors are gone, but
caching behaves the same as NFSv3. Since Solaris is supposed to have the
best NFS implementation, I must be misunderstanding something about
delegations.NFS cache tester[6] can be used to easily test how caches can be flushed in different operating systems. Successful flushing methods are listed below for some operating systems:
Note that some results may be different depending on whether the NFS server supports nanosecond or microsecond timestamp resolution. This is because the NFS client may flush its caches when noticing that mtime or ctime changed.
Results are the same with both NFSv3 and NFSv4.
timestamps resolution: seconds Testing file attribute cache.. Attr cache flush open+close: OK Attr cache flush close+open: OK Attr cache flush fchown(uid, -1): OK Attr cache flush fchmod(mode): OK Attr cache flush chown(uid, -1): OK Attr cache flush chmod(mode): OK Attr cache flush fcntl(shared): OK Attr cache flush fcntl(exclusive): OK Attr cache flush flock(shared): OK Attr cache flush flock(exclusive): OK Testing data cache.. Data cache flush fcntl(shared): OK Data cache flush fcntl(exclusive): OK Data cache flush flock(shared): OK Data cache flush flock(exclusive): OK Data cache flush O_DIRECT: OK Testing write flushing.. Write flush open+close: OK Write flush close+open: OK Write flush fchown(uid, -1): OK Write flush fchmod(mode): OK Write flush chown(uid, -1): OK Write flush chmod(mode): OK Write flush dup+close: OK Write flush fcntl(shared): OK Write flush fcntl(exclusive): OK Write flush flock(shared): OK Write flush flock(exclusive): OK Write flush fsync(): OK Write flush O_DIRECT: OK Testing file handle cache.. File handle cache flush fchown(uid, -1): OK File handle cache flush fchmod(mode): OK File handle cache flush chown(uid, -1): OK File handle cache flush chmod(mode): OK File handle cache flush rmdir(parent dir): OK Testing negative file handle cache.. Negative file handle cache flush fchown(uid, -1): OK Negative file handle cache flush fchmod(mode): OK Negative file handle cache flush chown(uid, -1): OK Negative file handle cache flush chmod(mode): OK Negative file handle cache flush rmdir(parent dir): OK
Same result with second and microsecond resolution.
Testing file attribute cache.. Attr cache flush open+close: OK Attr cache flush close+open: OK Attr cache flush fchown(-1, -1): OK Attr cache flush fchown(uid, -1): OK Attr cache flush fchmod(mode): OK Attr cache flush chown(-1, -1): OK Attr cache flush chown(uid, -1): OK Attr cache flush chmod(mode): OK Attr cache flush rmdir(): OK Testing data cache.. Data cache flush O_DIRECT: OK Testing write flushing.. Write flush open+close: OK Write flush close+open: OK Write flush fsync(): OK Write flush fcntl(O_SYNC): OK Testing file handle cache.. File handle cache flush rmdir(): OK File handle cache flush rmdir(parent dir): OK Testing negative file handle cache.. Negative file handle cache flush no caching: OK
Results are the same with both NFSv3 and NFSv4.
Negative file handle cache returns OK for everything if timestamps have microsecond resolution.
Testing file attribute cache.. Attr cache flush open+close: OK Attr cache flush close+open: OK Attr cache flush fchown(-1, -1): OK Attr cache flush fchown(uid, -1): OK Attr cache flush fchmod(mode): OK Attr cache flush chown(-1, -1): OK Attr cache flush chown(uid, -1): OK Attr cache flush chmod(mode): OK Attr cache flush fcntl(shared): OK Attr cache flush fcntl(exclusive): OK Testing data cache.. Data cache flush fcntl(shared): OK Data cache flush fcntl(exclusive): OK Testing write flushing.. Write flush open+close: OK Write flush close+open: OK Write flush fchown(-1, -1): OK Write flush fchown(uid, -1): OK Write flush fchmod(mode): OK Write flush chown(-1, -1): OK Write flush chown(uid, -1): OK Write flush chmod(mode): OK Write flush fcntl(shared): OK Write flush fcntl(exclusive): OK Write flush fsync(): OK Write flush fcntl(O_SYNC): OK Testing file handle cache.. File handle cache flush rmdir(parent dir): OK Testing negative file handle cache.. Negative file handle cache flush rmdir(parent dir): OK