↤ Philosophy of How to Learn ↑ How to Learn ↑ Consensus ↦

Userland Disk I/O

APIs

File I/O

In database land, most databases open(2) their WAL and data files with O_DIRECT so that write(2)/writev(2)/pwritev(2) perform unbuffered IO, maintain their own page cache, and utilize fdatasync() for durability. Doing so gives the most control over what data maintained in the page cache, allows directly modifying cached data, and using O_DIRECT then skips the kernel’s page cache when reading and writing data from or to disk. O_SYNC/O_DSYNC allow a single write() with O_DIRECT to be equivalent to a write() followed by a fsync()/fdatasync(). In the Linux world, the existence of O_DIRECT is surprisingly controversial, and Linus has some famous rants on the subject illustrating the OS/DB world view mismatch.

There are some notable examples of databases that rely on buffered IO and the kernel page cache (e.g. RocksDB, LMDB). Relying on the kernel’s page cache can be polite in the context of an embedded database meant to be used within another application and co-exist with many other applications on a user’s computer. Leaving the caching decisions to the kernel means that more memory for the page cache can be easily granted when the system has the memory to spare, and can be reclaimed when more available memory is needed. If using buffered IO, preadv2/prwritev2’s flags can be helpful. pwritev2() has also gained support for multi-block atomic writes, which is conditional on filesystem and drive support^[1]. [1]: Drive support means a drive for which Atomic Write Unit Power Fail (awupf) in nvme-cli id-ctrl returns something greater than zero. I’ve never actually seen a drive support this though.

Regardless of using buffered or unbuffered IO, it’s wise to be mindful of the extra cost of appending to a file versus overwriting a pre-allocated block within a file. Appending to a file changes the file’s size, which thus incurs additional filesystem metadata operations. Instead, consider using fallocate(2) to extend the file in larger chunks^[2]. Note that depending on the filesystem, fallocate(FALLOC_FL_ZERO_RANGE) "preferably" converts the range into unwritten extents. The range will not be physically zeroed out on the device and writes will still need to update metadata to mark the extent as written. Use of the default 0 mode is recommended. When using O_DIRECT, especially heed this guidance, as Clarifying Direct I/O Semantics largely focuses on that O_DIRECT writes which require metadata modifications are a complicated and not well specified topic. [2]: I once benchmarked the difference between appending one block every write versus writing over pre-allocated blocks as about a 40% throughput difference.

Directly invoking write() performs synchronous IO. Event-driven non-blocking IO is increasingly popular for concurrency and due to thread-per-core architectures, and there’s a number of different ways to do asynchronous IO in linux. Prefer them in the following order: io_uring(7) > io_submit(2) > aio(7) > epoll(7) > select(2). Support for the number of operations one can issue asynchronously decreases rapidly the further one gets from io_uring. For example, io_uring supports an async fallocate(2), but aio doesn’t; aio supports async fsync(), and epoll doesn’t. A library which issues synchronous filesystem calls on background threads, like libeio, will be needed to fill in support where it’s missing. For the utmost performance, one can use SPDK, but its usage is particularly not friendly. Understanding Modern Storage APIs ^[3] has a nice comparison of SPDK vs io_uring vs aio. [3]: Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio, SPDK, and io_uring. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22), Association for Computing Machinery, New York, NY, USA, 120–127. [scholar]

For commentary on each of these asynchronous IO frameworks, the libev source code is a treasure which catalogs all the caveats in a leading rant comment in each source file:

libev/ev_epoll.c
libev/ev_linuxaio.c — highly recommended reading
libev/ev_iouring.c — "overall, the API itself is, I dare to say, not a total trainwreck."

On macOS, the options for asynchronous IO are incredibly limited. The aio_* calls from aio(7) will work, but the io_* ones are linux-specific. Otherwise, use of libeio or a similar threadpool-based asynchronous IO framework is recommended.

On Windows, I/O Completion Ports is the canonical way to perform asynchronous IO.

Durability

fsync(2) is the core primitive for making data durable, by which we mean "writes completed before fsync() begun will continue to exist on disk even if you rip out the power cable after fsync() completes". fsync() is the beloved function because the alternatives have caveats. sync(2) applies all buffered writes to all disks, not just the ones performed as part of the database’s operations. sync_file_range(2) allows only ranges of a file to have their buffered changes forced to disk, but is non-standard, and only provides durability on ext4 and xfs^[4]. [4]: The sync_file_range() manpage states "This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches." However, testing done by the BonsaiDB author confirmed that FUA bits are set only on ext4 and xfs, but not btrfs or zfs. See discussion within Sled and RocksDB for further reasons to be cautious.

For each method of ensuring data durably reaches disk, there’s a split between the methods that ensure File Integrity (fsync() and O_SYNC) and those that ensure Data Integrity (fdatasync() and O_DSYNC). For their definitions, we look to the man pages:

O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don’t need the guarantees of file integrity completion.

To understand the difference between the two types of completion, consider two pieces of file metadata: the file last modification timestamp (st_mtime) and the file length. All write operations will update the last file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not needed to ensure that a read completes successfully, but the file length is. Thus, O_DSYNC would only guarantee to flush updates to the file length metadata (whereas O_SYNC would also always flush the last modification timestamp metadata).

— open(2)

To reiterate, if you issue a write to a file using O_DIRECT which appends to the file, and then call fdatasync(), the appended data will be durable and readable once fdatasync() returns, as the filesystem metadata change to increase the length of the file is required to have also been made durable.

However, using fsync() correctly and minimally is still not easy. Dan Luu’s File Consistency page and links therein provide a nice overview of some of the challenges. On the Complexity of Crafting Crash-Consistent Applications^[5] takes an even deeper look. The general rules to be aware of are: [5]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), USENIX Association, Broomfield, CO, 433–448. [scholar]

To write into a new file, first fsync() the file, then fsync() the containing directory.
If using the rename()-is-atomic trick, again first fsync() the file, then rename(),then fsync() the directory.^[6] [6]: Except there’s this one report of rename() not being atomic on the Windows Subsystem for Linux, so who knows.
On the first open of a mutable file, call fsync(), as a previous incarnation of the process might have crashed and left non-durable changes in the file.

fsync() can fail, and improperly handling that error started fsyncgate as folk noticed PostgreSQL among other databases had incorrect handling of it. This received further examination in Can Applications Recover from fsync Failures?^[7]. [7]: Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2021. Can Applications Recover from fsync Failures? ACM Trans. Storage 17, 2 (June 2021). [scholar]

On macOS, note that the functions and promises to enforce durability are different than those on Unix platforms, because macOS intentionally violated the standards. See Darwin’s Deceptive Durability for the overview. And continuing its quest to be as difficult as possible to write a database on, macOS uniquely does not support O_DIRECT, and thus one must invoke fcntl(F_NOCACHE) to get equivalent behavior.

On BSD flavors, UFS specifically does not issue a volatile drive cache flush as part of fsync(), as UFS relies on softupdates for consistency. Matthew Dillon from DragonflyBSD land has a well written complaint about this, and my attempt at reviewing the current UFS code seems to agree there’s no BIO_FLUSH issued. This is also confirmed on the postgres mailing list where Thomas Munro additionally points out O_SYNC/O_DSYNC are not forced through the volatile disk cache on UFS as well. There is no workaround to get powersafe durability, other than using ZFS or the native ext3/4 support.

On Windows, FlushFileBuffers()^[8] is equivalent to fsync(), and NtFlushBuffersFileEx(FLUSH_FLAGS_FILE_DATA_SYNC_ONLY) is equivalent to fdatasync(). For files opened with _open(), call _commit() instead. [8]: Except this terrifying note on the reliability of FlushBuffersFile saying "Fortunately, nearly all drivers in the Windows 7 era respect [the command to force changes to disk]. (There are a few stragglers that still ignore it.)"

Lastly, Force Unit Access (commonly abbreviated as FUA) is the term that the SCSI/SATA/NVMe specifications use for "please force this data to be on non-volatile storage", so if you’re ever trying to google durability related things, adding "FUA" will get you better answers.

Filesystems

Prefer XFS if you can. It benchmarks overall well. It handles a bunch of special cases well that are important for databases.

Filesystems maintain metadata about how blocks are associated with files, and optimizing around this will lead to lower latency. Ext4 and XFS both can aggregate contiguous blocks in a file into a single extent, reducing the metadata overhead. This encourages appending to files in large chunks at a time (or using fallocate to extend the file before performing a series of small appends). Maintaining large extents also potentially discourages excessive use of some filesystem metadata calls, as e.g. fine-grained use of FALLOC_FL_PUNCH_HOLE would be an easy way to continuously fragment extents. Large files incur large metadata, and so it’s often a good idea to incrementally truncate down a large file before unlinking it, otherwise the entire metadata traversal and deleting will be performed synchronously with the unlink.

How the storage device is attached to the system changes the number of parallel operations it can possibly support. (And the range is wide: SATA NCQ supports 32 concurrent requests, NVMe supports 65k.) If you submit more than this, there’s implicit queuing that happens in the kernel and userspace only sees increased latencies. Theoretically ionice(1) and ioprio_set(2) offer some control over how requests are prioritized in that queue, but I’ve never really noticed ionice make a difference.

It’s possible to open a raw block device and entirely bypass the filesystem. Doing so requires that all reads and writes be 4k aligned and a multiple of 4k in size. It also requires reimplementing everything that comes for free with a filesystem: free block tracking, disk space usage reporting, snapshot-based backup/restore, application logging, drive health testing. Anecdotally, I’ve heard that the advantage of all of this is an ~10% speedup, so not a tradeoff that’s often worth the cost. But for easy experimentation and testing of direct block storage access, a loopback device (losetup(8)) allows mounting a file as a block device. I’d highly recommend using xNVMe if you’re looking to directly interact with NVMe block storage.

Kernel Things

Be aware of IO schedulers. The general advice is to prefer mq-deadline or none for SSDs (SATA or NVME), as the drives are fast enough that excessive scheduling overhead generally isn’t worthwhile.

If using buffered io, vm.dirty_ratio controls when Linux will start writing modified pages to disk.

You can periodically scrape /proc/diskstats to self-report on disk metrics.