Disaggregated OLTP Systems

These are notes prepared from an informal presentation on the various cloud-native disaggregated OLTP RDBMS designs that have been getting published.

Amazon Aurora

Read these two papers together, and don’t try to stop to understand all the fine details about log consistency across replicas and commit or recovery protocols in the first paper. That material is covered in more detail (and with diagrams!) in the second paper. I’d almost suggest reading the second paper first.

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD/PODS'17), ACM, 1041–1052. [scholar]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, James Corey, Kamal Gupta, Murali Brahmadesam, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvilli, and Xiaofeng Bao. 2018. Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD/PODS '18), ACM. [scholar]

As the first of the disaggregated OLTP papers, they introduce the motivation for wanting to build a system like Aurora. Previously, one would just run an unmodified MySQL on top of EBS, and when looking at the amount of data transferred, the same data was being sent in different forms multiple times. A log, a binlog, a page write, and the double-write buffer write, are all essentially doing the same work of sending a tuple from MySQL to storage.

MySQL on EBS
RDSArchitecture

Thus, Aurora introduces using the write-ahead log as the protocol between the compute and the storage in a disaggregated system. The page writes, double-write buffer, etc. are all removed and made the responsibility of the storage after receiving the write-ahead log. The papers we’re looking at all reference this model with the phrase the log is the database in some form as part of their design.

Aurora Architecture
AuroraArchitecture

The major idea they present is that the network is then the bottleneck in the system, and the smart storage is able to meaningfully offload work of processing WAL updates into page modifications, handle MVCC cleanup, checkpointing, etc.

So I think the easiest way to get started is to zoom in on a single storage node:

Aurora Storage Node
AuroraStorageNode

It involves the following steps:

  1. receive log record and add to an in-memory queue,

  2. persist record on disk and acknowledge,

  3. organize records and identify gaps in the log since some batches may be lost,

  4. gossip with peers to fill in gaps,

  5. coalesce log records into new data pages,

  6. periodically stage log and new pages to S3,

  7. periodically garbage collect old versions, and finally

  8. periodically validate CRC codes on pages.

Storage nodes are used as part of a quorum, and the classic "tolerate loss of 1 AZ + 1 machine" means 6-node quorums with |W|=4 and |R|=3. The quorum means that transient single node failures (either accidental network blips or intentional node upgrades) are handled seamlessly. However, this isn’t traditional majority quorums. The Primary is an elected sole writer, which transforms majority quorums into something more consistent. Page server quorums are also reconfigured on suspected failure. This is a replication design that doesn’t even fit cleanly into my Data Replication Design Spectrum blog post.

Each log write contains its LSN, and also includes the last PSN sent to the storage group. Not every write is a transaction commit. There’s a whole discussion of Storage Consistency Points in the second paper to dig into the exact relationships between the Volume Complete LSN and the Consistency Point LSN and the Segment Complete LSN and a Protection Group Complete LSN. The overall point to get here is that trying to recover a consistent snapshot from a highly partitioned log is hard.

There’s a recovery flow to follow when the primary fails. A new primary must contact every storage group to find what’s the highest LSN below which all log records are known, and then recover to min(max LSN per group), but again, that’s a summary, because the reality seems complicated. However, the work of then applying the redo logs to properly recover to that LSN is now parallelized across many storage nodes, leading to a faster recovery.

As there’s only one nominated writer for the quorum of nodes, the writer knows which nodes in the quorum have accepted writes up to what version. This means that reads don’t need to be quorum reads, the primary is free to send read requests only to one of the at-least-four nodes that it knows should have the correct data.

Read-only replicas consume the binlog from the primary, and apply to cached pages only. Uncached data comes from storage groups. S3 is used for backups.

Discussion

Is this a trade of decreasing the amount of work on writes at the cost of increasing the amount of work on reads?

Moving the storage to over the network does add some cost, reconstructing full pages at arbitrary versions isn’t cheap, and while MySQL could apply the WAL entry directly to the buffer cached page the storage node might have to fetch the old page from disk. But much of the work is work that MySQL would otherwise be doing: finding old versions of tuples by chaining through the undo log, fuzzy checkpointing, etc. So while fetching pages from disk over a network is slower than fetching them locally, it is a good argument that it lets MySQL focus more on the query execution and transaction processing than storage management.

Microsoft Socrates

Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD/PODS '19), ACM. [scholar]

The paper spends some time talking about the previous DR architecture, its relevant behavior and features, and its shared nothing design. There’s also a decent amount of discussion around about adapting a pre-existing RDBMS to the new architecture. It’s overall a very realistic discussion of making major architectural changes to a large, pre-existing product, but I’m not going to focus on either as that’s not what my focus is.

The architecture of Socrates is well illustrated in the paper:

Socrates Architecture
SocratesArchitecture
Socrates XLOG Service
SocratesXLOG

Their major design decisions are:

The primary has a recoverable buffer pool to minimize impact from failures by modeling the buffer pool as a table in an in-memory storage engine. A buffer pool on SSD might seem silly, but otherwise a cold start means dumping gigabytes worth of page fetches at Page Servers, with terrible performance until the working set is back in cache. This is implemented by implementing the extended buffer pool as an in-memory table in Hekaton.

There is a separate XLOG service which is responsible for the WAL. The primary sends log to LZ and XLOG in parallel. XLOG buffers received WAL segments until the primary informs it the segments are durable in the LZ, at which point they’re forwarded onto the page servers. It also has a local cache, and moves log segments to blob storage over time.

Page servers don’t store all pages. They have a large (and persistent) cache, but some pages live only on XStore. They’re working on offloading bulk loading, index creation, DB reorgs, deep page repair, and table scans to Page Servers as well.

The GetPage@LSN RPC serves the page at a version that’s at least the specified LSN. Page servers thus aren’t required to materialize pages at any version, and can keep only the most recent. B-tree traversals from replicas sometimes need to restart if a leaf page is a newer LSN than the parent.

What’s the major difference between Socrates and Aurora? Aurora partitions the WAL across page servers. Socrates has a centralized WAL service.

Discussion

Socrates feels like a very modern object storage-based database in the WarpStream or turbopuffer kind of way for it being a 2019 paper. This architecture is the closest to Neon’s as well.

The extended buffer pool / "Resilient Cache" on the primary sounds like a really complicated mmap() implementation.

Would VM migration keep the cache?

Probably not? This raised an interesting point that trying to binpack SQL Server instances across a fleet of instances seems difficult, especially with them all being tied to a persistent cache. Azure SQL Database is sold in vCPU and DTU models, which seem to be more reservation based, so maybe there isn’t an overly high degree of churn?

Are the caches actually local SSD or are they Azure Managed Disks?

Consensus was that it seemed pretty strongly implied that they were actually SSD.

Alibaba PolarDB

Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang, Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao, Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD/PODS '21), ACM, 2477–2489. [scholar]

Consider also reading the PolarFS paper, as it is referenced a bit. TL;DR, it used RDMA and fast SSDs to make a fast filesystem which shards blocks across raft instances optimized for being block storage.

As broad context, Alibaba is really about spending money on fancy hardware. I had talked about this a bit in Modern Database Hardware, but Alibaba seems to be more than happy to solve difficult software problems by spending money hardware. Notably, Alibaba has RDMA deployed out internally, seemingly to the same extent that Microsoft does, except Microsoft seems to keep a fallback-to-TCP option for most of their stack, and Alibaba seems comfortable building services that critically depend on RDMA’s primitives.

Thus, much of the PolarDB Serverless paper is about leveraging a multi-tenant scale-out memory pool, built via RDMA. This makes them also a disaggregated memory database! As a direct consequence, memory and CPU can be scaled independently, and the evaluation shows elastically changing the amount of memory allocated to a PolarDB tenant.

However, implementing a page cache over RDMA isn’t trivial, and a solid portion of the paper is spent talking about the exact details of managing latches on remote memory pages and navigating b-tree traversals. Specifically, B-tree operations which change the structure of the tree required significant care. Recovery also has to deal with that the remote buffer cache has all the partial execution state from the failed RW node, so the new RW node has to release latches in the shared memory pool and throw away pages which were partially modified.

They offer an architecture diagram:

PolarDB Serverless Architecture
PolarDBArchitecture

However, there’s a few things I think it doesn’t represent well:

There’s a couple optimizations that they specifically call out. Read-only nodes don’t acquire latches in the buffer pool unless the RW node says it modified the B-tree structure since the Read-only node’s last access. They also implement a specific optimization for indexes: a prefetching index probe operation. Fetching keys from the index will generate prefetches to load the pointed-to data pages from the page servers, under the assumption that they’ll be immediately requested as part of SQL execution anyway.

What’s the major difference between PolarDB and Socrates? Socrates has SSD persisted caches. PolarDB has a persistent distributed memory cache.

Discussion

They still undersold the RDMA difficulty. Someone who has worked with it previously commented that there’s all sorts of issues about racing reads and writes, and getting group membership and shard movement right is doubly hard. In both cases, an uninformed client can still do one-sided RDMA reads from a server they think is still a part of a replication group and/or has the shard it wants.

Huawei Taurus

Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Cui, Qiang Liu, Wei Huang, Yong Xiao, and Yongjun He. 2020. Taurus Database: How to be Fast, Available, and Frugal in the Cloud. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS '20), ACM, 1463–1478. [scholar]

The entire "Background and Related Work" section is a great read. They set up excellent and concise comparisons against the same systems we’ve discussed above. In very short summary: PolarFS (not PolarDB Serverless) uses a filesystem abstraction without smart storage and thus loses efficiency, Aurora uses 6-node quorums for both logs and pages which over-promises on durability and availability respective, and Socrates added too much complexity with its four teir Compute/XLOG/Page Server/XSTORE architecture.

Taurus Architecture
TaurusArchitecture

In Taurus’s Log Store, WAL segments are sent to a fixed-size append-only synchronously replication storage object called a PLog (Part of a Log?). In a deployment, there’s hundreds of Log Servers. Three are chosen to form a PLog. All three must ack the write, otherwise a new PLog is allocated. (It’s reconfiguation-based replication!) The database WAL is an ordered collection of PLogs, itself stored in a PLog. Metadata PLogs are chained as a linked list.

The Page Stores behave roughly the same, they accept logs and serve versioned pages. Page Stores are notified of the oldest LSN which still might be requested, and must be able to answer what the hightest LSN they can serve is.

Taurus abstracts most of the logic of dealing with Log Stores and Page Stores into a Storage Abstraction Layer, which manages the mapping of WAL segments to PLogs and slices to Page Stores. The paper describes the read and write flow in detail, but it didn’t feel notably different from any of the previously discussed systems.

Taurus Write Path
TaurusWritePath

For anyone who is against reconfiguration-based replication because of the "unavailability" while reconfiguring to a new set of available replicas, you’ll hate the "comparison with quorum replication". They argue that their probability of write unavailability is effectively zero as all Log Stores or Page Stores from their global pool of nodes would have to be unavailable for a new shard to be un-allocatable. This argument both is and isn’t true.

Both recovery and replication to read-only replicas is discussed in decent detail, but neither felt notably different. I do appreciate the level of detail though in illustrating how recovery works, as it was more pleasant to go through here than in some other papers. Replication to read-only has just been about applying log records to cached pages in every system thus far. They do mention separating notifying replicas that there were WAL changes published (and where to find them), from actually serving that data from Log Servers, so that the primary isn’t responsible for the network bandwidth of broadcasting WAL changes. The Page Stores also gossip the data so that Log Servers aren’t being entirely taxed for network bandwidth either.

Page stores are append-only on disk, with a lock-free hashtable mapping (page,version) to slot in log. The hashtable is periodically saved to storage to bound recovery time. Page Stores have their own buffer pool, which is mostly to avoid IO during the lookup of the previous page to apply a WAL entry. There’s an interesting tidbit that LFU is a better cache replacement policy for second-level caches.

What’s the major difference between Taurus and others? Reconfiguration-based replication!

Huawei Taurus Multi-Master

Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, and Calvin Sun. 2023. Taurus MM: Bringing Multi-Master to the Cloud. Proceedings of the VLDB Endowment 16, 12 (August 2023), 3488–3500. [scholar]

This is, admittedly, mostly an excuse to discuss multi-master designs within disaggregated OLTP. Aurora had multi-master implemented, which they’ve since reverted. Socrates was against multi-master. PolarDB mentioned the global page cache means they could support it, but such work was out of scope for the paper. So TaurusDB is our chance to look at this design.

Taurus Multi-Master Architecture
TaurusMMArchitecture

Multi-master means concurrent modifications, and naively that means LSN is now a vector clock. Introduces a clock type that’s a hybrid between a vector clock and a scalar lamport clock. Basically, for server 3, clock[3]=lamport clock and the rest of the indexes are a vector clock. This has the effect of advancing the server’s clock faster, as it’s effectively a counter of causally related global events rather than local events. Times when causality is already known, like operations serialized by contending on a lock, Taurus uses the scalar clock. Logs and pages are locally recorded with a scalar clock, and sent to the Log Service with a vector clock. Page reads are done with a scalar clock.

The other side of concurrent modifications is that page locking can no longer be done locally in RAM on one primary replica. So the paper next discusses locking. Locks are held globally in a Global Lock Manager at page granularity with the usual Shared/eXclusive locking scheme. Once a master has a page lock, it can grant equal or lesser row locks. Pages can be unlocked and returned to the GLM if another master requests the page, but the rows will stay locked. (Imagine wanting exclusive locks on different rows in the same page.) The Global Lock Manager would also be responsible for deadlock detection.

Note the introduction of another component: the Global Slice Manager. Sharding pages across servers is a decision that no master is allowed to make locally, so the responsibility of sharding data was moved to a global component.

In comparison against Aurora Multi-Master, it’s noted that Aurora pushed resolving conflicts between masters to the storage layer. In the evaluation, the two designs perform similarly when there’s no data sharing, but the Taurus design performs much better as data sharing increases.

Discussion

MariaDB Xpand actually did something similar to this, but they never wrote about it, and the project was shut down by MariaDB.

Multi-master is also useful for upgrades, as it gives one a way to do a rolling upgrade to a new database binary and incrementally shift transactions over. However, having two databases live at different versions means one also has to get upgrade/downgrade testing done well.

Who needs multi-master? Aurora dropped their own multi-master support, and rumor was it wasn’t getting heavily used. Is there actually a desire for this? Are there enough customers topping over their disaggregated OLTP database with excessive writes that it’s worthwhile to make the investment into all the complexity that multi-master brings?


See discussion of this page on Reddit, HN, and lobsters.