Wednesday, July 23, 2008

What Is ZFS ?

Description:

The Solaris ZFS file system is a revolutionary new file system that fundamentally changes the way file systems are administered, with features and benefits not found in any other file system available today. ZFS has been designed to be robust, scalable, and simple to administer.

ZFS Pooled Storage

ZFS uses the concept of storage pools to manage physical storage. Historically, file systems were constructed on top of a single physical device. To address multiple devices and provide for data redundancy, the concept of a volume manager was introduced to provide the image of a single device so that file systems would not have to be modified to take advantage of multiple devices. This design added another layer of complexity and ultimately prevented certain file system advances, because the file system had no control over the physical placement of data on the virtualized volumes.

ZFS eliminates the volume management altogether. Instead of forcing you to create virtualized volumes, ZFS aggregates devices into a storage pool. The storage pool describes the physical characteristics of the storage (device layout, data redundancy, and so on,) and acts as an arbitrary data store from which file systems can be created. File systems are no longer constrained to individual devices, allowing them to share space with all file systems in the pool. You no longer need to predetermine the size of a file system, as file systems grow automatically within the space allocated to the storage pool. When new storage is added, all file systems within the pool can immediately use the additional space without additional work. In many ways, the storage pool acts as a virtual memory system. When a memory DIMM is added to a system, the operating system doesn't force you to invoke some commands to configure the memory and assign it to individual processes. All processes on the system automatically use the additional memory.

Transactional Semantics

ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the machine loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. Historically, this problem was solved through the use of the fsck command. This command was responsible for going through and verifying file system state, making an attempt to repair any inconsistencies in the process. This problem caused great pain to administrators and was never guaranteed to fix all possible problems. More recently, file systems have introduced the concept of journaling. The journaling process records action in a separate journal, which can then be replayed safely if a system crash occurs. This process introduces unnecessary overhead, because the data needs to be written twice, and often results in a new set of problems, such as when the journal can't be replayed properly.

With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. This mechanism means that the file system can never be corrupted through accidental loss of power or a system crash. So, no need for a fsck equivalent exists. While the most recently written pieces of data might be lost, the file system itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.

Checksums and Self-Healing Data

With ZFS, all data and metadata is checksummed using a user-selectable algorithm. Traditional file systems that do provide checksumming have performed it on a per-block basis, out of necessity due to the volume management layer and traditional file system design. The traditional design means that certain failure modes, such as writing a complete block to an incorrect location, can result in properly checksummed data that is actually incorrect. ZFS checksums are stored in a way such that these failure modes are detected and can be recovered from gracefully. All checksumming and data recovery is done at the file system layer, and is transparent to applications.

In addition, ZFS provides for self-healing data. ZFS supports storage pools with varying levels of data redundancy, including mirroring and a variation on RAID-5. When a bad data block is detected, ZFS fetches the correct data from another replicated copy, and repairs the bad data, replacing it with the good copy.

Unparalleled Scalability

ZFS has been designed from the ground up to be the most scalable file system, ever. The file system itself is 128-bit, allowing for 256 quadrillion zettabytes of storage. All metadata is allocated dynamically, so no need exists to pre-allocate inodes or otherwise limit the scalability of the file system when it is first created. All the algorithms have been written with scalability in mind. Directories can have up to 248 (256 trillion) entries, and no limit exists on the number of file systems or number of files that can be contained within a file system.

ZFS Snapshots

A snapshot is a read-only copy of a file system or volume. Snapshots can be created quickly and easily. Initially, snapshots consume no additional space within the pool.

As data within the active dataset changes, the snapshot consumes space by continuing to reference the old data. As a result, the snapshot prevents the data from being freed back to the pool.

Simplified Administration

Most importantly, ZFS provides a greatly simplified administration model. Through the use of hierarchical file system layout, property inheritance, and automanagement of mount points and NFS share semantics, ZFS makes it easy to create and manage file systems without needing multiple commands or editing configuration files. You can easily set quotas or reservations, turn compression on or off, or manage mount points for numerous file systems with a single command. Devices can be examined or repaired without having to understand a separate set of volume manager commands. You can take an unlimited number of instantaneous snapshots of file systems. You can backup and restore individual file systems.

ZFS manages file systems through a hierarchy that allows for this simplified management of properties such as quotas, reservations, compression, and mount points. In this model, file systems become the central point of control. File systems themselves are very cheap (equivalent to a new directory), so you are encouraged to create a file system for each user, project, workspace, and so on. This design allows you to define fine-grained management points.

ZFS Terminology

This section describes the basic terminology used throughout this book:

checksum

A 256-bit hash of the data in a file system block. The checksum capability can range from the simple and fast fletcher2 (the default) to cryptographically strong hashes such as SHA256.

clone

A file system whose initial contents are identical to the contents of a snapshot.

For information about clones, see ZFS Clones.

dataset

A generic name for the following ZFS entities: clones, file systems, snapshots, or volumes.

Each dataset is identified by a unique name in the ZFS namespace. Datasets are identified using the following format:

pool/path[@snapshot]

pool

Identifies the name of the storage pool that contains the dataset

path

Is a slash-delimited path name for the dataset object

snapshot

Is an optional component that identifies a snapshot of a dataset

For more information about datasets, see Chapter 5, Managing ZFS File Systems.

file system

A dataset that contains a standard POSIX file system.

For more information about file systems, see Chapter 5, Managing ZFS File Systems.

mirror

A virtual device that stores identical copies of data on two or more disks. If any disk in a mirror fails, any other disk in that mirror can provide the same data.

pool

A logical group of devices describing the layout and physical characteristics of the available storage. Space for datasets is allocated from a pool.

For more information about storage pools, see Chapter 4, Managing ZFS Storage Pools.

RAID-Z

A virtual device that stores data and parity on multiple disks, similar to RAID-5. For more information about RAID-Z, see RAID-Z Storage Pool Configuration.

resilvering

The process of transferring data from one device to another device is known as resilvering. For example, if a mirror component is replaced or taken offline, the data from the up-to-date mirror component is copied to the newly restored mirror component. This process is referred to as mirror resynchronization in traditional volume management products.

For more information about ZFS resilvering, see Viewing Resilvering Status.

snapshot

A read-only image of a file system or volume at a given point in time.

For more information about snapshots, see ZFS Snapshots.

virtual device

A logical device in a pool, which can be a physical device, a file, or a collection of devices.

For more information about virtual devices, see Virtual Devices in a Storage Pool.

volume

A dataset used to emulate a physical device in order to support legacy file systems.

ZFS Component Naming Requirements

Each ZFS component must be named according to the following rules:

  • Empty components are not allowed.

  • Each component can only contain alphanumeric characters in addition to the following four special characters:

    • Underscore (_)

    • Hyphen (-)

    • Colon (:)

    • Period (.)

  • Pool names must begin with a letter, except that the beginning sequence c[0-9] is not allowed. In addition, pool names that begin with mirror, raidz, or spare are not allowed as these name are reserved.

  • Dataset names must begin with an alphanumeric character.

    ZFS Hardware and Software Requirements and Recommendations

    Make sure you review the following hardware and software requirements and recommendations before attempting to use the ZFS software:

    • A SPARC™ or x86 system that is running the Solaris™ Nevada release, build 27 or later.

    • The minimum disk size is 128 Mbytes. The minimum amount of disk space required for a storage pool is approximately 64 Mbytes.

    • Currently, the minimum amount of memory recommended to install a Solaris system is 512 Mbytes. However, for good ZFS performance, at least one Gbyte or more of memory is recommended.

    • If you create a mirrored disk configuration, multiple controllers are recommended.

Creating a Basic ZFS File System

ZFS administration has been designed with simplicity in mind. Among the goals of the ZFS design is to reduce the number of commands needed to create a usable file system. When you create a new pool, a new ZFS file system is created and mounted automatically.

The following example illustrates how to create a storage pool named tank and a ZFS file system name tank in one command. Assume that the whole disk /dev/dsk/c1t0d0 is available for use.

# zpool create tank c1t0d0 

The new ZFS file system, tank, can use as much of the disk space on c1t0d0 as needed, and is automatically mounted at /tank.

# mkfile 100m /tank/foo
# df -h /tank
Filesystem size used avail capacity Mounted on
tank 80G 100M 80G 1% /tank

Within a pool, you will probably want to create additional file systems. File systems provide points of administration that allow you to manage different sets of data within the same pool.

The following example illustrates how to create a file system named fs in the storage pool tank. Assume that the whole disk /dev/dsk/c1t0d0 is available for use.

# zpool create tank c1t0d0
# zfs create tank/fs

The new ZFS file system, tank/fs, can use as much of the disk space on c1t0d0 as needed, and is automatically mounted at /tank/fs.

# mkfile 100m /tank/fs/foo
# df -h /tank/fs
Filesystem size used avail capacity Mounted on
tank/fs 80G 100M 80G 1% /tank/fs

In most cases, you will probably want to create and organize a hierarchy of file systems that matches your organizational needs. For more information about creating a hierarchy of ZFS file systems, see Creating a ZFS File System Hierarchy.

Creating a ZFS Storage Pool

The previous example illustrates the simplicity of ZFS. The remainder of this chapter demonstrates a more complete example similar to what you would encounter in your environment. The first tasks are to identify your storage requirements and create a storage pool. The pool describes the physical characteristics of the storage and must be created before any file systems are created.

Identifying Storage Requirements

  1. Determine available devices.

    Before creating a storage pool, you must determine which devices will store your data. These devices must be disks of at least 128 Mbytes in size, and they must not be in use by other parts of the operating system. The devices can be individual slices on a preformatted disk, or they can be entire disks that ZFS formats as a single large slice.

    For the storage example used in Creating the ZFS Storage Pool, assume that the whole disks /dev/dsk/c1t0d0 and /dev/dsk/c1t0d0 are available for use.

    For more information about disks and how they are used and labeled, see Using Disks in a ZFS Storage Pool.

  2. Choose data replication.

    ZFS supports multiple types of data replication, which determines what types of hardware failures the pool can withstand. ZFS supports nonredundant (striped) configurations, as well as mirroring and RAID-Z (a variation on RAID-5).

    For the storage example used in Creating the ZFS Storage Pool, basic mirroring of two available disks is used.

    For more information about ZFS replication features, see Replication Features of a ZFS Storage Pool.

Creating the ZFS Storage Pool

  1. Become root or assume an equivalent role with the appropriate ZFS rights profile.

    For more information about the ZFS rights profiles, see ZFS Rights Profiles.

  2. Pick a pool name.

    The pool name is used to identify the storage pool when you are using the zpool or zfs commands. Most systems require only a single pool, so you can pick any name that you prefer, provided it satisfies the naming requirements outlined in ZFS Component Naming Requirements.

  3. Create the pool.

    For example, create a mirrored pool that is named tank.

    # zpool create tank mirror c1t0d0 c1t1d0 

    If one or more devices contains another file system or is otherwise in use, the command cannot create the pool.

    For more information about creating storage pools, see Creating a ZFS Storage Pool.

    For more information about how device usage is determined, see Detecting in Use Devices.

  4. View the results.

    You can determine if your pool was successfully created by using the zpool list command.

    # zpool list
    NAME SIZE USED AVAIL CAP HEALTH ALTROOT
    tank 80G 137K 80G 0% ONLINE -

    For more information about viewing pool status, see Querying ZFS Storage Pool Status.

Creating a ZFS File System Hierarchy

After creating a storage pool to store your data, you can create your file system hierarchy. Hierarchies are simple yet powerful mechanisms for organizing information. They are also very familiar to anyone who has used a file system.

ZFS allows file systems to be organized into arbitrary hierarchies, where each file system has only a single parent. The root of the hierarchy is always the pool name. ZFS leverages this hierarchy by supporting property inheritance so that common properties can be set quickly and easily on entire trees of file systems.

Determining the ZFS File System Hierarchy

  1. Pick the file system granularity.

    ZFS file systems are the central point of administration. They are lightweight and can be created easily. A good model to use is a file system per user or project, as this model allows properties, snapshots, and backups to be controlled on a per-user or per-project basis.

    Two ZFS file systems, bonwick and billm, are created in Creating ZFS File Systems.

    For more information on managing file systems, see Chapter 5, Managing ZFS File Systems.

  2. Group similar file systems.

    ZFS allows file systems to be organized into hierarchies so that similar file systems can be grouped. This model provides a central point of administration for controlling properties and administering file systems. Similar file systems should be created under a common name.

    For the example in Creating ZFS File Systems, the two file systems are placed under a file system named home.

  3. Choose the file system properties.

    Most file system characteristics are controlled by using simple properties. These properties control a variety of behavior, including where the file systems are mounted, how they are shared, if they use compression, and if any quotas are in effect.

    For the example in Creating ZFS File Systems, all home directories are mounted at /export/zfs/ user, are shared by using NFS, and with compression enabled. In addition, a quota of 10 Gbytes on bonwick is enforced.

    For more information about properties, see ZFS Properties.

Creating ZFS File Systems

  1. Become root or assume an equivalent role with the appropriate ZFS rights profile.

    For more information about the ZFS rights profiles, see ZFS Rights Profiles.

  2. Create the desired hierarchy.

    In this example, a file system that acts as a container for individual file systems is created.

    # zfs create tank/home 

    Next, individual file systems are grouped under the home file system in the pool tank.

  3. Set the inherited properties.

    After the file system hierarchy is established, set up any properties that should be shared among all users:

    # zfs set mountpoint=/export/zfs tank/home
    # zfs set sharenfs=on tank/home
    # zfs set compression=on tank/home
    # zfs get compression tank/home
    NAME PROPERTY VALUE SOURCE
    tank/home compression on local

    For more information about properties and property inheritance, see ZFS Properties.

  4. Create the individual file systems.

    Note that the file systems could have been created and then the properties could have been changed at the home level. All properties can be changed dynamically while file systems are in use.

    # zfs create tank/home/bonwick
    # zfs create tank/home/billm

    These file systems inherit their property settings from their parent, so they are automatically mounted at /export/zfs/ user and are NFS shared. You do not need to edit the /etc/vfstab or /etc/dfs/dfstab file.

    For more information about creating file systems, see Creating a ZFS File System.

    For more information about mounting and sharing file systems, see Mounting and Sharing ZFS File Systems.

  5. Set the file system-specific properties.

    In this example, user bonwick is assigned a quota of 10 Gbytes. This property places a limit on the amount of space he can consume, regardless of how much space is available in the pool.

    # zfs set quota=10G tank/home/bonwick 
  6. View the results.

    View available file system information by using the zfs list command:

    # zfs list
    NAME USED AVAIL REFER MOUNTPOINT
    tank 92.0K 67.0G 9.5K /tank
    tank/home 24.0K 67.0G 8K /export/zfs
    tank/home/billm 8K 67.0G 8K /export/zfs/billm
    tank/home/bonwick 8K 10.0G 8K /export/zfs/bonwick

    Note that the user bonwick only has 10 Gbytes of space available, while the user billm can use the full pool (67 Gbytes).

    For more information about viewing file system status, see Querying ZFS File System Information.

    For more information about how space is used and calculated, see ZFS Space Accounting.