Userland Disk I/O

mental model

File I/O

In database land, most databases open(2)Linux their WAL and data files with O_DIRECT so that write(2)Linux/writev(2)Linux/pwritev(2)Linux perform unbuffered IO, maintain their own page cache, and utilize fdatasync() for durability. Doing so gives the most control over what data maintained in the page cache, allows directly modifying cached data, and using O_DIRECT then skips the kernel’s page cache when reading and writing data from or to disk. O_SYNC/O_DSYNC allow a single write() with O_DIRECT to be equivalent to a write() followed by a fsync()/fdatasync(). In the Linux world, the existence of O_DIRECT is surprisingly controversial, and Linus has some famous rants on the subject illustrating the OS/DB world view mismatch.

There are some notable examples of databases that rely on buffered IO and the kernel page cache (e.g. RocksDB, LMDB). Relying on the kernel’s page cache can be polite in the context of an embedded database meant to be used within another application and co-exist with many other applications on a user’s computer. Leaving the caching decisions to the kernel means that more memory for the page cache can be easily granted when the system has the memory to spare, and can be reclaimed when more available memory is needed. If using buffered IO, preadv2/prwritev2’s flags can be helpful. pwritev2() has also gained support for multi-block atomic writes, which is conditional on filesystem and drive support[1]. [1]: Drive support means a drive for which Atomic Write Unit Power Fail (awupf) in nvme-cli id-ctrl returns something greater than zero. I’ve never actually seen a drive support this though.

Directly invoking write() performs synchronous IO. Event-driven non-blocking IO is increasingly popular for concurrency and due to thread-per-core architectures, and there’s a number of different ways to do asynchronous IO in linux. Prefer them in the following order: io_uring(7)Linux > aio(7)Linux > epoll(7)Linux > select(2)Linux. Support for the number of operations one can issue asynchronously decreases rapidly the further one gets from io_uring. For example, io_uring supports an async fallocate(2)Linux, but aio doesn’t; aio supports async fsync(), and epoll doesn’t. A library which issues synchronous filesystems calls on background threads, like libeio, will be needed to fill in support where it’s missing. For the utmost performance, one can use SPDK, but its usage is particularly not friendly. Understanding Modern Storage APIs[2] has a nice comparison of SPDK vs io_uring vs aio. [2]: Diego Didona, Jonas Pfefferle, Nikolas Ioannou, Bernard Metzler, and Animesh Trivedi. 2022. Understanding modern storage APIs: a systematic study of libaio, SPDK, and io_uring. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR '22), Association for Computing Machinery, New York, NY, USA, 120–127. [scholar]

For commentary on each of these asynchronous IO frameworks, the libev source code is a treasure which catalogs all the caveats in a leading rant comment in each source file:

Durability

fsync(2)Linux is the core primitive for making data durable, by which we mean "writes completed before fsync() begun will continue to exist on disk even if you rip out the power cable after fsync() completes". fsync() is the beloved function because the alternatives have caveats. sync(2)Linux applies all buffered writes to all disks, not just the ones performed as part of the database’s operations. sync_file_range(2)Linux allows only ranges of a file to have their buffered changes forced to disk, but is non-standard, and only provides durability on ext4 and xfs[3]. [3]: The sync_file_range() manpage states "This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches." However, testing done by the BonsaiDB author confirmed that FUA bits are set only on ext4 and xfs, but not btrfs or zfs. See discussion within Sled and RocksDB for further reasons to be cautious.

For each method of ensuring data durably reaches disk, there’s a split between the methods that ensure File Integrity (fsync() and O_SYNC) and those that ensure Data Integrity (fdatasync() and O_DSYNC). For their definitions, we look to the man pages:

O_SYNC provides synchronized I/O file integrity completion, meaning write operations will flush data and all associated metadata to the underlying hardware. O_DSYNC provides synchronized I/O data integrity completion, meaning write operations will flush data to the underlying hardware, but will only flush metadata updates that are required to allow a subsequent read operation to complete successfully. Data integrity completion can reduce the number of disk operations that are required for applications that don’t need the guarantees of file integrity completion.

To understand the difference between the two types of completion, consider two pieces of file metadata: the file last modification timestamp (st_mtime) and the file length. All write operations will update the last file modification timestamp, but only writes that add data to the end of the file will change the file length. The last modification timestamp is not needed to ensure that a read completes successfully, but the file length is. Thus, O_DSYNC would only guarantee to flush updates to the file length metadata (whereas O_SYNC would also always flush the last modification timestamp metadata).

open(2)Linux

To reiterate, if you issue a write to a file using O_DIRECT which appends to the file, and then call fdatasync(), the appended data will be durable and readable once fdatasync() returns, as the filesystem metadata change to increase the length of the file is required to have also been made durable.

However, using fsync() correctly and minimally is still not easy. Dan Luu’s File Consistency page and links therein provide a nice overview of some of the challenges. On the Complexity of Crafting Crash-Consistent Applications[4] takes an even deeper look. The general rules to be aware of are: [4]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), USENIX Association, Broomfield, CO, 433–448. [scholar]

fsync() can fail, and improperly handling that error started fsyncgate as folk noticed PostgreSQL among other databases had incorrect handling of it. This received further examination in Can Applications Recover from fsync Failures?[5]. [5]: Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2021. Can Applications Recover from fsync Failures? ACM Trans. Storage 17, 2 (June 2021). [scholar]

Lastly, note that the functions and promises to enforce durability on macOS are different than those on Unix platforms, because macOS intentionally violated the standards. See Darwin’s Deceptive Durability for the overview.

Filesystems

Prefer XFS if you can. It benchmarks overall well. It handles a bunch of special cases well that are important for databases.

Filesystems maintain metadata about how blocks are associated with files, and optimizing around this will lead to lower latency. Ext4 and XFS both can aggregate contiguous blocks in a file into a single extent, reducing the metadata overhead. This encourages appending to files in large chunks at a time (or using fallocate to extend the file before performing a series of small appends). Maintaining large extents also potentially discourages excessive use of some filesystem metadata calls, as e.g. fine-grained use of FALLOC_FL_PUNCH_HOLE would be an easy way to continuously fragment extents. Large files incur large metadata, and so it’s often a good idea to incrementally truncate down a large file before unlinking it, otherwise the entire metadata traversal and deleting will be performed synchronously with the unlink.

How the storage device is attached to the system changes the number of parallel operations it can possibly support. (And the range is wide: SATA NCQ supports 32 concurrent requests, NVMe supports 65k.) If you submit more than this, there’s implicit queuing that happens in the kernel. Theoretically ionice(1)Linux and ioprio_set(2)Linux offer some control over how requests are prioritized in that queue, but I’ve never really noticed ionice make a difference.

It’s possible to open a raw block device and entirely bypass the filesystem. Doing so requires that all reads and writes be 4k aligned and a multiple of 4k in size. It also requires reimplementing everything that comes for free with a filesystem: free block tracking, disk space usage reporting, snapshot-based backup/restore, application logging, drive health testing. Anecdotally, I’ve heard that the advantage of all of this is an ~10% speedup, so not a tradeoff that’s often worth the cost. But for easy experimentation and testing of direct block storage access, a loopback device (losetup(8)Linux) allows mounting a file as a block device. I’d highly recommend using xNVMe if you’re looking to directly interact with NVMe block storage.

Kernel Things

Be aware of IO schedulers. The general advice is to prefer mq-deadline or none for SSDs (SATA or NVME), as the drives are fast enough that excessive scheduling overhead generally isn’t worthwhile.

If using buffered io, vm.dirty_ratio controls when Linux will start writing modified pages to disk.

You can periodically scrape /proc/diskstats to self-report on disk metrics.