pNFSserver —
NFS
Version 4.1 Parallel NFS Protocol Server
A set of FreeBSD servers may be configured to provide a
pnfs(4) service. One FreeBSD system needs to be
configured as a MetaData Server (MDS) and at least one additional FreeBSD
system needs to be configured as one or more Data Servers (DS)s.
These FreeBSD systems are configured to be NFSv4.1 servers, see
nfsd(8) and
exports(5) if you are not familiar with
configuring a NFSv4.1 server.
The DS(s) need to be configured as NFSv4.1 server(s), with a top level exported
directory used for storage of data files. This directory must be owned by
“root” and would normally have a mode of “700”.
Within this directory there needs to be additional directories named
ds0,...,dsN (where N is 19 by default) also owned by “root” with
mode “700”. These are the directories where the data files are
stored. The following command can be run by root when in the top level
exported directory to create these subdirectories.
jot -w ds 20 0 | xargs mkdir -m 700
Note that “20” is the default and can be set to a larger value on
the MDS as shown below.
The top level exported directory used for storage of data files must be exported
to the MDS with the “maproot=root sec=sys” export options so
that the MDS can create entries in these subdirectories. It must also be
exported to all pNFS aware clients, but these clients do not require the
“maproot=root” export option and this directory should be
exported to them with the same options as used by the MDS to export file
system(s) to the clients.
It is possible to have multiple DSs on the same FreeBSD system, but each of
these DSs must have a separate top level exported directory used for storage
of data files and each of these DSs must be mountable via a separate IP
address. Alias addresses can be set on the DS server system for a network
interface via
ifconfig(8) to create these
different IP addresses. Multiple DSs on the same server may be useful when
data for different file systems on the MDS are being stored on different file
system volumes on the FreeBSD DS system.
The MDS must be a separate FreeBSD system from the FreeBSD DS system(s) and NFS
clients. It is configured as a NFSv4.1 server with file system(s) exported to
clients. However, the “-p” command line argument for
nfsd is used to indicate that it is running as
the MDS for a pNFS server.
The DS(s) must all be mounted on the MDS using the following mount options:
nfsv4,minorversion=1,soft,retrans=2
so that they can be defined as DSs in the “-p” option. Normally
these mounts would be entered in the
fstab(5) on
the MDS. For example, if there are four DSs named nfsv4-data[0-3], the
fstab(5) lines might look like:
nfsv4-data0:/ /data0 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
nfsv4-data1:/ /data1 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
nfsv4-data2:/ /data2 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
nfsv4-data3:/ /data3 nfs rw,nfsv4,minorversion=1,soft,retrans=2 0 0
The
nfsd(8) command line option “-p”
indicates that the NFS server is a pNFS MDS and specifies what DSs are to be
used.
For the above
fstab(5) example, the
nfsd(8) nfs_server_flags line in your
rc.conf(5) might look like:
nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3"
This example specifies that the data files should be distributed over the four
DSs and File layouts will be issued to pNFS enabled clients. If issuing
Flexible File layouts is desired for this case, setting the sysctl
“vfs.nfsd.default_flexfile” non-zero in your
sysctl.conf(5) file will make the
pNFSserver do that.
Alternately, this variant of “nfs_server_flags” will specify that
two way mirroring is to be done, via the “-m” command line
option.
nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0,nfsv4-data1:/data1,nfsv4-data2:/data2,nfsv4-data3:/data3 -m 2"
With two way mirroring, the data file for each exported file on the MDS will be
stored on two of the DSs. When mirroring is enabled, the server will always
issue Flexible File layouts.
It is also possible to specify which DSs are to be used to store data files for
specific exported file systems on the MDS. For example, if the MDS has
exported two file systems “/export1” and
“/export2” to clients, the following variant of
“nfs_server_flags” will specify that data files for
“/export1” will be stored on nfsv4-data0 and nfsv4-data1,
whereas the data files for “/export2” will be store on
nfsv4-data2 and nfsv4-data3.
nfs_server_flags="-u -t -n 128 -p nfsv4-data0:/data0#/export1,nfsv4-data1:/data1#/export1,nfsv4-data2:/data2#/export2,nfsv4-data3:/data3#/export2"
This can be used by system administrators to control where data files are stored
and might be useful for control of storage use. For this case, it may be
convenient to co-locate more than one of the DSs on the same FreeBSD server,
using separate file systems on the DS system for storage of the respective
DS's data files. If mirroring is desired for this case, the “-m”
option also needs to be specified. There must be enough DSs assigned to each
exported file system on the MDS to support the level of mirroring. The above
example would be fine for two way mirroring, but four way mirroring would not
work, since there are only two DSs assigned to each exported file system on
the MDS.
The number of subdirectories in each DS is defined by the
“vfs.nfs.dsdirsize” sysctl on the MDS. This value can be
increased from the default of 20, but only when the
nfsd(8) is not running and after the additional
ds20,... subdirectories have been created on all the DSs. For a service that
will store a large number of files this sysctl should be set much larger, to
avoid the number of entries in a subdirectory from getting too large.
Once operational, NFSv4.1 FreeBSD client mounts done with the
“pnfs” option should do I/O directly on the DSs. The clients
mounting the MDS must be running the
nfscbd
daemon for pNFS to work. Set
in the
rc.conf(5) on these clients. Non-pNFS aware
clients or NFSv3 mounts will do all I/O RPCs on the MDS, which acts as a proxy
for the appropriate DS(s).
Since the data is separated from the metadata, the simple way to back up a pNFS
service is to do so from an NFS client that has the service mounted on it. If
you back up the MDS exported file system(s) on the MDS, you must do it in such
a way that the “system” namespace extended attributes get backed
up.
When a mirrored DS fails, it can be disabled one of three ways:
1 - The MDS detects a problem when trying to do proxy operations on the DS. This
can take a couple of minutes after the DS failure or network partitioning
occurs.
2 - A pNFS client can report an I/O error that occurred for a DS to the MDS in
the arguments for a LayoutReturn operation.
3 - The system administrator can perform the
pnfsdskill(8) command on the MDS to
disable it. If the system administrator does a
pnfsdskill(8) and it fails with
ENXIO (Device not configured) that normally means the DS was already disabled
via #1 or #2. Since doing this is harmless, once a system administrator knows
that there is a problem with a mirrored DS, doing the command is recommended.
Once a system administrator knows that a mirrored DS has malfunctioned or has
been network partitioned, they should do the following as root/su on the MDS:
# pnfsdskill <mounted-on-path-of-DS>
# umount -N <mounted-on-path-of-DS>
Note that the <mounted-on-path-of-DS> must be the exact mounted-on path
string used when the DS was mounted on the MDS.
Once the mirrored DS has been disabled, the pNFS service should continue to
function, but file updates will only happen on the DS(s) that have not been
disabled. Assuming two way mirroring, that implies the one DS of the pair
stored in the “pnfsd.dsfile” extended attribute for the file on
the MDS, for files stored on the disabled DS.
The next step is to clear the IP address in the “pnfsd.dsfile”
extended attribute on all files on the MDS for the failed DS. This is done so
that, when the disabled DS is repaired and brought back online, the data files
on this DS will not be used, since they may be out of date. The command that
clears the IP address is
pnfsdsfile(8) with the
“-r” option.
For example:
# pnfsdsfile -r nfsv4-data3 yyy.c
yyy.c: nfsv4-data2.home.rick ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000 0.0.0.0 ds0/207508569ff983350c000000ec7c0200e4c57b2e0000000000000000
replaces nfsv4-data3 with an IPv4 address of 0.0.0.0, so that nfsv4-data3 will
not get used.
Normally this will be called within a
find(1)
command for all regular files in the exported directory tree and must be done
on the MDS. When used with
find(1), you will
probably also want the “-q” option so that it won't spit out the
results for every file. If the disabled/repaired DS is nfsv4-data3, the
commands done on the MDS would be:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile -q -r nfsv4-data3 {} ;
There is a problem with the above command if the file found by
find(1) is renamed or unlinked before the
pnfsdsfile(8) command is done on it. This should
normally generate an error message. A simple unlink is harmless but a
link/unlink or rename might result in the file not having been processed under
its new name. To check that all files have their IP addresses set to 0.0.0.0
these commands can be used (assuming the
sh(1)
shell):
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile {} ; | sed "/nfsv4-data3/!d"
Any line(s) printed require the
pnfsdsfile(8) with
“-r” to be done again. Once this is done, the replaced/repaired
DS can be brought back online. It should have empty ds0,...,dsN directories
under the top level exported directory for storage of data files just like it
did when first set up. Mount it on the MDS exactly as you did before disabling
it. For the nfsv4-data3 example, the command would be:
# mount -t nfs -o nfsv4,minorversion=1,soft,retrans=2 nfsv4-data3:/ /data3
Then restart the nfsd to re-enable the DS.
Now, new files can be stored on nfsv4-data3, but files with the IP address
zeroed out on the MDS will not yet use the repaired DS (nfsv4-data3). The next
step is to go through the exported file tree on the MDS and, for each of the
files with an IPv4 address of 0.0.0.0 in its extended attribute, copy the file
data to the repaired DS and re-enable use of this mirror for it. This command
for copying the file data for one MDS file is
pnfsdscopymr(8) and it will also normally be used
in a
find(1). For the example case, the commands
on the MDS would be:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdscopymr -r /data3 {} ;
When this completes, the recovery should be complete or at least nearly so. As
noted above, if a link/unlink or rename occurs on a file name while the above
find(1) is in progress, it may not get copied. To
check for any file(s) not yet copied, the commands are:
# cd <top-level-exported-dir>
# find . -type f -exec pnfsdsfile {} ; | sed "/0.0.0.0/!d"
If this command prints out any file name(s), these files must have the
pnfsdscopymr(8) command done on them to complete
the recovery.
# pnfsdscopymr -r /data3 <file-path-reported>
If this commmand fails with the error
“pnfsdscopymr: Copymr failed for file <path>: Device not
configured”
repeatedly, this may be caused by a Read/Write layout that has not been
returned. The only way to get rid of such a layout is to restart the
nfsd(8).
All of these commands are designed to be done while the pNFS service is running
and can be re-run safely.
For a more detailed discussion of the setup and management of a pNFS service
see:
http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
nfsv4(4),
pnfs(4),
exports(5),
fstab(5),
rc.conf(5),
sysctl.conf(5),
nfscbd(8),
nfsd(8),
nfsuserd(8),
pnfsdscopymr(8),
pnfsdsfile(8),
pnfsdskill(8)
The
pNFSserver command first appeared in
FreeBSD 12.0.
Since the MDS cannot be mirrored, it is a single point of failure just as a non
pNFS server is. For non-mirrored configurations, all FreeBSD systems used in
the service are single points of failure.