Darwin’s Deceptive Durability

A reminder that macOS does not respect the usual ways of making data durable on disk.

fsync does not fsync

As per the fsync manpage on darwin:

Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physi-cally physically cally write the data to the platters for quite some time and it may be written in an out-of-order sequence.

Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.

This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.

For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.

Note that fcntl(fd, F_FULLSYNC) is equivalent to fsync(), and is not a one-time setting to turn fsync() into fsync().

O_DSYNC does not O_DSYNC

Instead of writing data into the page cache, and then using fsync() to make it durable, one can request full durability per write() call. This is done by passing O_DIRECT | O_DSYNC when calling open() on the file. However, darwin silently downgrades this to only O_DIRECT. There is no workaround. One must use O_FULLSYNC instead.

The underlying libuv call for fdatasync() here is actually using F_FULLFSYNC on macOS (I contributed that at the time to libuv).

You can see the difference between FDATASYNC + O_DIRECT and O_DSYNC + O_DIRECT.

And the latter, on macOS, is no different from only O_DIRECT (!)

In other words, O_DSYNC on macOS also doesn't flush past the disk's own cache.

It's as durable as fcntl(fd, F_NOCACHE, 1). i.e. Not durable at all. ;)

— Joran Dirk Greef (@jorandirkgreef) June 2, 2022