zpoolconcepts —
overview of ZFS storage pools
A "virtual device" describes a single device or a collection of
devices organized according to certain performance and fault characteristics.
The following virtual devices are supported:
- disk
- A block device, typically located under
/dev. ZFS can use individual slices or
partitions, though the recommended mode of operation is to use whole
disks. A disk can be specified by a full path, or it can be a shorthand
name (the relative portion of the path under
/dev). A whole disk can be specified by
omitting the slice or partition designation. For example,
sda is equivalent to
/dev/sda. When given a whole disk, ZFS
automatically labels the disk, if necessary.
- file
- A regular file. The use of files as a backing store is
strongly discouraged. It is designed primarily for experimental purposes,
as the fault tolerance of a file is only as good as the file system on
which it resides. A file must be specified by a full path.
- mirror
- A mirror of two or more devices. Data is replicated in an
identical fashion across all components of a mirror. A mirror with
N disks of size
X can hold
X bytes and can
withstand N-1 devices failing without
losing data.
-
raidz,
raidz1, raidz2,
raidz3
- A variation on RAID-5 that allows for better distribution
of parity and eliminates the RAID-5 “write hole” (in which
data and parity become inconsistent after a power loss). Data and parity
is striped across all disks within a raidz group.
A raidz group can have single, double, or triple parity, meaning that the
raidz group can sustain one, two, or three failures, respectively, without
losing any data. The raidz1 vdev type
specifies a single-parity raidz group; the
raidz2 vdev type specifies a double-parity
raidz group; and the raidz3 vdev type
specifies a triple-parity raidz group. The
raidz vdev type is an alias for
raidz1.
A raidz group with N disks of
size X with
P parity disks can hold
approximately (N-P)*X
bytes and can withstand
P devices failing without
losing data. The minimum number of devices in a raidz group is one
more than the number of parity disks. The recommended number is between 3
and 9 to help increase performance.
-
draid,
draid1, draid2,
draid3
- A variant of raidz that provides integrated distributed hot
spares which allows for faster resilvering while retaining the benefits of
raidz. A dRAID vdev is constructed from multiple internal raidz groups,
each with D data devices
and P parity
devices. These groups are distributed over all of the children in
order to fully utilize the available disk performance.
Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
zeros) to allow fully sequential resilvering. This fixed stripe width
significantly effects both usable capacity and IOPS. For example, with the
default D=8 and
4kB disk sectors the minimum
allocation size is 32kB. If using
compression, this relatively large allocation size can reduce the
effective compression ratio. When using ZFS volumes and dRAID, the default
of the volblocksize property is increased to
account for the allocation size. If a dRAID pool will hold a significant
amount of small blocks, it is recommended to also add a mirrored
special vdev to store those blocks.
In regards to I/O, performance is similar to raidz since for any read all
D data disks must be
accessed. Delivered random IOPS can be reasonably approximated as
floor((N-S)/(D+P))*single_drive_IOPS.
Like raidzm a dRAID can have single-, double-, or triple-parity. The
draid1, draid2,
and draid3 types can be used to specify the
parity level. The draid vdev type is an alias
for draid1.
A dRAID with N disks of
size X, D
data disks per redundancy group,
P parity level, and
S distributed hot spares can
hold approximately (N-S)*(D/(D+P))*X
bytes and can withstand
P devices failing without losing data.
-
draid[parity][:datad][:childrenc][:sparess]
- A non-default dRAID configuration can be specified by
appending one or more of the following optional arguments to the
draid keyword:
- parity
- The parity level (1-3).
- data
- The number of data devices per redundancy group. In
general, a smaller value of D
will increase IOPS, improve the compression
ratio, and speed up resilvering at the expense of total usable
capacity. Defaults to 8,
unless N-P-S
is less than
8.
- children
- The expected number of children. Useful as a
cross-check when listing a large number of devices. An error is
returned when the provided number of children differs.
- spares
- The number of distributed hot spares. Defaults to
zero.
- spare
- A pseudo-vdev which keeps track of available hot spares for
a pool. For more information, see the
Hot Spares section.
- log
- A separate intent log device. If more than one log device
is specified, then writes are load-balanced between devices. Log devices
can be mirrored. However, raidz vdev types are not supported for the
intent log. For more information, see the
Intent Log section.
- dedup
- A device dedicated solely for deduplication tables. The
redundancy of this device should match the redundancy of the other normal
devices in the pool. If more than one dedup device is specified, then
allocations are load-balanced between those devices.
- special
- A device dedicated solely for allocating various kinds of
internal metadata, and optionally small file blocks. The redundancy of
this device should match the redundancy of the other normal devices in the
pool. If more than one special device is specified, then allocations are
load-balanced between those devices.
For more information on special allocations, see the
Special
Allocation Class section.
- cache
- A device used to cache storage pool data. A cache device
cannot be configured as a mirror or raidz group. For more information, see
the Cache Devices
section.
Virtual devices cannot be nested, so a mirror or raidz virtual device can only
contain files or disks. Mirrors of mirrors (or other combinations) are not
allowed.
A pool can have any number of virtual devices at the top of the configuration
(known as “root vdevs”). Data is dynamically distributed across
all top-level devices to balance data among devices. As new virtual devices
are added, ZFS automatically places data on the newly available devices.
Virtual devices are specified one at a time on the command line, separated by
whitespace. Keywords like
mirror
and raidz are used to
distinguish where a group ends and another begins. For example, the following
creates a pool with two root vdevs, each a mirror of two disks:
# zpool
create mypool
mirror sda sdb
mirror sdc
sdd
ZFS supports a rich set of mechanisms for handling device failure and data
corruption. All metadata and data is checksummed, and ZFS automatically
repairs bad data from a good copy when corruption is detected.
In order to take advantage of these features, a pool must make use of some form
of redundancy, using either mirrored or raidz groups. While ZFS supports
running in a non-redundant configuration, where each root vdev is simply a
disk or file, this is strongly discouraged. A single case of bit corruption
can render some or all of your data unavailable.
A pool's health status is described by one of three states:
online,
degraded,
or faulted. An online
pool has all devices operating normally. A degraded pool is one in which one
or more devices have failed, but the data is still available due to a
redundant configuration. A faulted pool has corrupted metadata, or one or more
faulted devices, and insufficient replicas to continue functioning.
The health of the top-level vdev, such as a mirror or raidz device, is
potentially impacted by the state of its associated vdevs, or component
devices. A top-level vdev or component device is in one of the following
states:
- DEGRADED
- One or more top-level vdevs is in the degraded state
because one or more component devices are offline. Sufficient replicas
exist to continue functioning.
One or more component devices is in the degraded or faulted state, but
sufficient replicas exist to continue functioning. The underlying
conditions are as follows:
- The number of checksum errors exceeds acceptable
levels and the device is degraded as an indication that something may
be wrong. ZFS continues to use the device as necessary.
- The number of I/O errors exceeds acceptable levels.
The device could not be marked as faulted because there are
insufficient replicas to continue functioning.
- FAULTED
- One or more top-level vdevs is in the faulted state because
one or more component devices are offline. Insufficient replicas exist to
continue functioning.
One or more component devices is in the faulted state, and insufficient
replicas exist to continue functioning. The underlying conditions are as
follows:
- The device could be opened, but the contents did not
match expected values.
- The number of I/O errors exceeds acceptable levels
and the device is faulted to prevent further use of the device.
- OFFLINE
- The device was explicitly taken offline by the
zpool offline
command.
- ONLINE
- The device is online and functioning.
- REMOVED
- The device was physically removed while the system was
running. Device removal detection is hardware-dependent and may not be
supported on all platforms.
- UNAVAIL
- The device could not be opened. If a pool is imported when
a device was unavailable, then the device will be identified by a unique
identifier instead of its path since the path was never correct in the
first place.
Checksum errors represent events where a disk returned data that was expected to
be correct, but was not. In other words, these are instances of silent data
corruption. The checksum errors are reported in
zpool status and
zpool events. When a
block is stored redundantly, a damaged block may be reconstructed (e.g. from
raidz parity or a mirrored copy). In this case, ZFS reports the checksum error
against the disks that contained damaged data. If a block is unable to be
reconstructed (e.g. due to 3 disks being damaged in a raidz2 group), it is not
possible to determine which disks were silently corrupted. In this case,
checksum errors are reported for all disks on which the block is stored.
If a device is removed and later re-attached to the system, ZFS attempts online
the device automatically. Device attachment detection is hardware-dependent
and might not be supported on all platforms.
ZFS allows devices to be associated with pools as “hot spares”.
These devices are not actively used in the pool, but when an active device
fails, it is automatically replaced by a hot spare. To create a pool with hot
spares, specify a
spare vdev with any number of
devices. For example,
# zpool
create pool
mirror sda sdb
spare sdc
sdd
Spares can be shared across multiple pools, and can be added with the
zpool add command
and removed with the
zpool
remove command. Once a spare replacement is
initiated, a new
spare vdev is created within the
configuration that will remain there until the original device is replaced. At
this point, the hot spare becomes available again if another device fails.
If a pool has a shared spare that is currently being used, the pool can not be
exported since other pools may use this shared spare, which may lead to
potential data corruption.
Shared spares add some risk. If the pools are imported on different hosts, and
both pools suffer a device failure at the same time, both could attempt to use
the spare at the same time. This may not be detected, resulting in data
corruption.
An in-progress spare replacement can be cancelled by detaching the hot spare. If
the original faulted device is detached, then the hot spare assumes its place
in the configuration, and is removed from the spare list of all active pools.
The
draid vdev type provides distributed hot
spares. These hot spares are named after the dRAID vdev they're a part of
(
draid1-
2-
3
specifies spare 3
of vdev 2,
which is a single parity dRAID) and may only be used
by that dRAID vdev. Otherwise, they behave the same as normal hot spares.
Spares cannot replace log devices.
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
transactions. For instance, databases often require their transactions to be
on stable storage devices when returning from a system call. NFS and other
applications can also use
fsync(2) to ensure data
stability. By default, the intent log is allocated from blocks within the main
pool. However, it might be possible to get better performance using separate
intent log devices such as NVRAM or a dedicated disk. For example:
# zpool
create pool sda
sdb log
sdc
Multiple log devices can also be specified, and they can be mirrored. See the
EXAMPLES section for an example
of mirroring multiple log devices.
Log devices can be added, replaced, attached, detached and removed. In addition,
log devices are imported and exported as part of the pool that contains them.
Mirrored devices can be removed by specifying the top-level mirror vdev.
Devices can be added to a storage pool as “cache devices”. These
devices provide an additional layer of caching between main memory and disk.
For read-heavy workloads, where the working set size is much larger than what
can be cached in main memory, using cache devices allows much more of this
working set to be served from low latency media. Using cache devices provides
the greatest performance improvement for random read-workloads of mostly
static content.
To create a pool with cache devices, specify a
cache vdev with any number of devices. For
example:
# zpool
create pool sda
sdb cache sdc
sdd
Cache devices cannot be mirrored or part of a raidz configuration. If a read
error is encountered on a cache device, that read I/O is reissued to the
original storage pool device, which might be part of a mirrored or raidz
configuration.
The content of the cache devices is persistent across reboots and restored
asynchronously when importing the pool in L2ARC (persistent L2ARC). This can
be disabled by setting
l2arc_rebuild_enabled=
0.
For cache devices smaller than
1GB, we do not
write the metadata structures required for rebuilding the L2ARC in order not
to waste space. This can be changed with
l2arc_rebuild_blocks_min_l2size. The cache device
header (
512B) is updated even if no metadata
structures are written. Setting
l2arc_headroom=
0
will result in scanning the full-length ARC lists for cacheable content to be
written in L2ARC (persistent ARC). If a cache device is added with
zpool add its label
and header will be overwritten and its contents are not going to be restored
in L2ARC, even if the device was previously part of the pool. If a cache
device is onlined with
zpool
online its contents will be restored in L2ARC.
This is useful in case of memory pressure where the contents of the cache
device are not fully restored in L2ARC. The user can off- and online the cache
device when there is less memory pressure in order to fully restore its
contents to L2ARC.
Before starting critical procedures that include destructive actions (like
zfs destroy), an
administrator can checkpoint the pool's state and in the case of a mistake or
failure, rewind the entire pool back to the checkpoint. Otherwise, the
checkpoint can be discarded when the procedure has completed successfully.
A pool checkpoint can be thought of as a pool-wide snapshot and should be used
with care as it contains every part of the pool's state, from properties to
vdev configuration. Thus, certain operations are not allowed while a pool has
a checkpoint. Specifically, vdev removal/attach/detach, mirror splitting, and
changing the pool's GUID. Adding a new vdev is supported, but in the case of a
rewind it will have to be added again. Finally, users of this feature should
keep in mind that scrubs in a pool that has a checkpoint do not repair
checkpointed data.
To create a checkpoint for a pool:
# zpool
checkpoint
pool
To later rewind to its checkpointed state, you need to first export it and then
rewind it during import:
# zpool
export
pool
# zpool
import
--rewind-to-checkpoint
pool
To discard the checkpoint from a pool:
# zpool
checkpoint -d
pool
Dataset reservations (controlled by the
reservation
and refreservation
properties) may be unenforceable while a checkpoint exists, because the
checkpoint is allowed to consume the dataset's reservation. Finally, data that
is part of the checkpoint but has been freed in the current state of the pool
won't be scanned during a scrub.
Allocations in the special class are dedicated to specific block types. By
default this includes all metadata, the indirect blocks of user data, and any
deduplication tables. The class can also be provisioned to accept small file
blocks.
A pool must always have at least one normal
(non-
dedup/-
special)
vdev before other devices can be assigned to the special class. If the
special class becomes full, then allocations
intended for it will spill back into the normal class.
Deduplication tables can be excluded from the special class by unsetting the
zfs_ddt_data_is_special ZFS module parameter.
Inclusion of small file blocks in the special class is opt-in. Each dataset can
control the size of small file blocks allowed in the special class by setting
the
special_small_blocks property to nonzero. See
zfsprops(7) for more info on this property.