Disaggregated OLTP Systems

These notes were first prepared for an informal presentation on the various cloud-native disaggregated OLTP RDBMS designs that have been getting published and it cherry-picked one paper per notable design decision. For the papers covered then, I’ve included a summary of the discussion we had after each paper. This page is being actively extended to cover all disaggregated OLTP papers, even for papers that are similar between two different vendors. ("Actively" meaning as of 2025-02-09.)

If you’re looking for a quick overview of the disaggregated OLTP space, just read the papers with a ★.

Amazon

Aurora ★

Read these two papers together, and don’t try to stop to understand all the fine details about log consistency across replicas and commit or recovery protocols in the first paper. That material is covered in more detail (and with diagrams!) in the second paper. I’d almost suggest reading the second paper first.

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD/PODS'17), ACM, 1041–1052. [scholar]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, James Corey, Kamal Gupta, Murali Brahmadesam, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvilli, and Xiaofeng Bao. 2018. Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD/PODS '18), ACM. [scholar]

As the first of the disaggregated OLTP papers, they introduce the motivation for wanting to build a system like Aurora. Previously, one would just run an unmodified MySQL on top of EBS, and when looking at the amount of data transferred, the same data was being sent in different forms multiple times. A log, a binlog, a page write, and the double-write buffer write, are all essentially doing the same work of sending a tuple from MySQL to storage.

MySQL on EBS
RDSArchitecture

Thus, Aurora introduces using the write-ahead log as the protocol between the compute and the storage in a disaggregated system. The page writes, double-write buffer, etc. are all removed and made the responsibility of the storage after receiving the write-ahead log. The papers we’re looking at all reference this model with the phrase the log is the database in some form as part of their design.

Aurora Architecture
AuroraArchitecture

The major idea they present is that the network is then the bottleneck in the system, and the smart storage is able to meaningfully offload work of processing WAL updates into page modifications, handle MVCC cleanup, checkpointing, etc.

So I think the easiest way to get started is to zoom in on a single storage node:

Aurora Storage Node
AuroraStorageNode

It involves the following steps:

  1. receive log record and add to an in-memory queue,

  2. persist record on disk and acknowledge,

  3. organize records and identify gaps in the log since some batches may be lost,

  4. gossip with peers to fill in gaps,

  5. coalesce log records into new data pages,

  6. periodically stage log and new pages to S3,

  7. periodically garbage collect old versions, and finally

  8. periodically validate CRC codes on pages.

Storage nodes are used as part of a quorum, and the classic "tolerate loss of 1 AZ + 1 machine" means 6-node quorums with |W|=4 and |R|=3. The quorum means that transient single node failures (either accidental network blips or intentional node upgrades) are handled seamlessly. However, this isn’t traditional majority quorums. The Primary is an elected sole writer, which transforms majority quorums into something more consistent. Page server quorums are also reconfigured on suspected failure. This is a replication design that doesn’t even fit cleanly into my Data Replication Design Spectrum blog post.

Each log write contains its LSN, and also includes the last PSN sent to the storage group. Not every write is a transaction commit. There’s a whole discussion of Storage Consistency Points in the second paper to dig into the exact relationships between the Volume Complete LSN and the Consistency Point LSN and the Segment Complete LSN and a Protection Group Complete LSN. The overall point to get here is that trying to recover a consistent snapshot from a highly partitioned log is hard.

There’s a recovery flow to follow when the primary fails. A new primary must contact every storage group to find what’s the highest LSN below which all log records are known, and then recover to min(max LSN per group), but again, that’s a summary, because the reality seems complicated. However, the work of then applying the redo logs to properly recover to that LSN is now parallelized across many storage nodes, leading to a faster recovery.

As there’s only one nominated writer for the quorum of nodes, the writer knows which nodes in the quorum have accepted writes up to what version. This means that reads don’t need to be quorum reads, the primary is free to send read requests only to one of the at-least-four nodes that it knows should have the correct data.

Read-only replicas consume the binlog from the primary, and apply to cached pages only. Uncached data comes from storage groups. S3 is used for backups.

Discussion

Is this a trade of decreasing the amount of work on writes at the cost of increasing the amount of work on reads?

Moving the storage to over the network does add some cost, reconstructing full pages at arbitrary versions isn’t cheap, and while MySQL could apply the WAL entry directly to the buffer cached page the storage node might have to fetch the old page from disk. But much of the work is work that MySQL would otherwise be doing: finding old versions of tuples by chaining through the undo log, fuzzy checkpointing, etc. So while fetching pages from disk over a network is slower than fetching them locally, it is a good argument that it lets MySQL focus more on the query execution and transaction processing than storage management.

Aurora Serverless

Bradley Barnhart, Marc Brooker, Daniil Chinenkov, Tony Hooper, Jihoun Im, Prakash Chandra Jha, Tim Kraska, Ashok Kurakula, Alexey Kuznetsov, Grant McAlister, Arjun Muthukrishnan, Aravinthan Narayanan, Douglas Terry, Bhuvan Urgaonkar, and Jiaming Yan. 2024. Resource Management in Aurora Serverless. Proceedings of the VLDB Endowment 17, 12 (August 2024), 4038–4050. [scholar]

This paper describes the transition from their naive Aurora Serverless v1 (ASv1) to Aurora Serverless v2 (ASv2). It covers both the product dimensions of billing and end-user experiences, and the internal technical parts of how to orchestrate scaling up/down, managing load, and transferring user workloads with minimum distruption. ASv1 relied upon relaunching a database instance in order to change its scale. A multi-tenant proxy frontend was created to allow sessions to be transferred between a rapidly restarted database instance. This session transfer was incomplete (temporary tables couldn’t be transferred), disruptive (due to transient unavailability), and inelastic as paying the cost of a restart only made sense for large (power of 2) instance size changes. The goal of ASv2 was to be able to scale faster, less disruptively, and be able to better track a cyclical workload.

Customers buy Aurora Serverless in units of Aurora Capacity Units (ACUs), which is a combination of 2GB RAM + 0.25 vCPU + an undefined amount of networking and block device throughput. Users define a ceiling and floor in ACU of what they wish for their database to scale up or down to, and then Aurora Serverless tries to autoscale to approximate fully elastic, usage-driven pricing.

Aurora Serverless is split into fleet-wide, inter-host rebalancing; and host-local, intra-host, in-place scaling.

AuroraServerlessArchitecture

Instance Managers gather resource usage information for database instances on a host, and work within the host’s resource limits to scale instances up or down to meet the resource needs. The Fleet Manager controls database instance to host assignment. Hosts' resources are oversubscribed, and when hosts are under resource pressure (at a critical level for CPU, allocated RAM, network, or disk throughput), the Fleet Manager will assign temporary ACU limits and live migrate database instances to redistribute heat across the cluster and relieve the resource pressure. The scale-up rate is limited by the Instance Manager to give the Fleet Manager time to react. The Fleet Manager will not live migrate from hosts which are deemed not to have the available network bandwidth to sustain an out-migration. New database instances are placed assuming minimum ACU usage. The Fleet Manager also adjusts the size of the fleet according to predicted and actual demand.

The Fleet Manager must choose what instance to move, and to which host to move it. Choosing an instance is a three step process: remove any ineligible instances, compute a preferences score (e.g. don’t move frequently moved instances, prefer instances that have ack’d a heartbeat recently), and compute a numerical score (how much resources will be freed up, combined with what fraction of unused resources does this instance have). Instances with equal preference scores are tiebroken by numerical score. Target host selection proceeds similarly: ineligible hosts are removed, compute a preference score (fault tolerance distribution, no recent migration failures), and a numerical score (best-fit binpacking score, and most utilized resource percentage). In the evaluation, they show that this 3 phase approach does a better job of distributing load across the fleet than a baseline of just best-fit with less instance movement.

Database instances are wrapped in VMs for security reasons, and thus resource elasticity must be done in cooperation with the guest OS of each VM. Every VM is of the same 128 ACU maximum instance size. This relies on Nitro’s SR-IOV support for having efficient virtualized IO. Memory elasticity required a number of changes: memory can be offlined to prevent it from being used for page cache and so that Linux doesn’t keep a page table entry around for every page, cold pages are swapped out, and 4KB pages are coalesced to make 2MB sized free pages which can be reclaimed by the hypervisor. Memory scales up based on the desired buffer pool size over the past 30 seconds, and down over the past 60 seconds. CPU scales up based on P50 over the past 30 seconds, and down by P70 over the past 60 seconds. Scaling up is done using the maximum of the two, scaling down uses the minimum.

Microsoft Socrates ★

Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD/PODS '19), ACM. [scholar]

The paper spends some time talking about the previous DR architecture, its relevant behavior and features, and its shared nothing design. There’s also a decent amount of discussion around about adapting a pre-existing RDBMS to the new architecture. It’s overall a very realistic discussion of making major architectural changes to a large, pre-existing product, but I’m not going to focus on either as this is only a disaggregated OLTP overview.

The architecture of Socrates is well illustrated in the paper:

Socrates Architecture
SocratesArchitecture
Socrates XLOG Service
SocratesXLOG

Their major design decisions are:

The primary has a recoverable buffer pool to minimize impact from failures by modeling the buffer pool as a table in an in-memory storage engine. A buffer pool on SSD might seem silly, but otherwise a cold start means dumping gigabytes worth of page fetches at Page Servers, with terrible performance until the working set is back in cache. This is implemented by implementing the extended buffer pool as an in-memory table in Hekaton.

There is a separate XLOG service which is responsible for the WAL. The primary sends log to LZ and XLOG in parallel. XLOG buffers received WAL segments until the primary informs it the segments are durable in the LZ, at which point they’re forwarded onto the page servers. It also has a local cache, and moves log segments to blob storage over time.

Page servers don’t store all pages. They have a large (and persistent) cache, but some pages live only on XStore. They’re working on offloading bulk loading, index creation, DB reorgs, deep page repair, and table scans to Page Servers as well.

The GetPage@LSN RPC serves the page at a version that’s at least the specified LSN. Page servers thus aren’t required to materialize pages at any version, and can keep only the most recent. B-tree traversals from replicas sometimes need to restart if a leaf page is a newer LSN than the parent.

What’s the major difference between Socrates and Aurora? Aurora partitions the WAL across page servers. Socrates has a centralized WAL service.

Discussion

Socrates feels like a very modern object storage-based database in the WarpStream or turbopuffer kind of way for it being a 2019 paper. This architecture is the closest to Neon’s as well.

The extended buffer pool / "Resilient Cache" on the primary sounds like a really complicated mmap() implementation.

Would VM migration keep the cache?

Probably not? This raised an interesting point that trying to binpack SQL Server instances across a fleet of instances seems difficult, especially with them all being tied to a persistent cache. Azure SQL Database is sold in vCPU and DTU models, which seem to be more reservation based, so maybe there isn’t an overly high degree of churn?

Are the caches actually local SSD or are they Azure Managed Disks?

Consensus was that it seemed pretty strongly implied that they were actually SSD.

Alibaba

As broad context, Alibaba is really about spending money on fancy hardware. I had talked about this a bit in Modern Database Hardware, but Alibaba’s papers quickly illustrate that they’re more than happy to sol seems to be more than happy to solve difficult software problems by spending significant stacks of money on very modern hardware. Notably, Alibaba has RDMA deployed out internally, seemingly to the same extent that Microsoft does, except Microsoft seems to keep a fallback-to-TCP option for most of their stack, and Alibaba seems comfortable building services that critically depend on RDMA’s primitives.

PolarFS

Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment 11, 12 (August 2018), 1849–1862. [scholar]

Alibaba took an unusual first step in building a disaggregated OLTP database. Instead of spending their effort building a separate pageserver and modifying the database to request pages from it and offload recovery to it, they invested effort into just building a sufficiently fast distributed filesystem. A year after the paper was published, Alibaba opensourced PolarFS as AsparaDB/PolarDB-FileSystemGitHub (and PolarDB as ApsaraDB/PolarDB-for-PostgreSQLGitHub, with the PolarFS usage included), and so I’ve sprinkled links to it in the summary.

In terms of architectural components: libpfs is the client library that exposes a POSIX-like filesystem API, PolarSwitch is a process run on the same host which redirects I/O requests from applications to ChunkServers, ChunkServers are deployed on storage nodes to serve I/O requests, and PolarCtrl is the control plane. PolarCtrl’s metadata about the system is stored in a MySQL instance. The only necessary modifications to PolarDB were to port the filesystem calls to libpfs.

PolarFS Architecture
PolarFSArchitecture

The libpfs API is given as:

int     pfs_mount(const char *volname, int host_id)
int     pfs_umount(const char *volname)
int     pfs_mount_growfs(const char *volname)

int     pfs_creat(const char *volpath, mode_t mode)
int     pfs_open(const char *volpath, int flags, mode_t mode)
int     pfs_close(int fd)
ssize_t pfs_read(int fd, void *buf, size_t len)
ssize_t pfs_write(int fd, const void *buf, size_t len)
off_t   pfs_lseek(int fd, off_t offset, int whence)
ssize_t pfs_pread(int fd, void *buf, size_t len, off_t offset)
ssize_t pfs_pwrite(int fd, const void *buf, size_t len, off_t offset)
int     pfs_stat(const char *volpath, struct stat *buf)
int     pfs_fstat(int fd, struct stat *buf)
int     pfs_posix_fallocate(int fd, off_t offset, off_t len)
int     pfs_unlink(const char *volpath)
int     pfs_rename(const char *oldvolpath, const char *newvolpath)
int     pfs_truncate(const char *volpath, off_t len)
int     pfs_ftruncate(int fd, off_t len)
int     pfs_access(const char *volpath, int amode)

int     pfs_mkdir(const char *volpath, mode_t mode)
DIR*    pfs_opendir(const char *volpath)
struct dirent *pfs_readdir(DIR *dir)
int     pfs_readdir_r(DIR *dir, struct dirent *entry,
                      struct dirent **result)
int     pfs_closedir(DIR *dir)
int     pfs_rmdir(const char *volpath)
int     pfs_chdir(const char *volpath)
int     pfs_getcwd(char *buf)

Which has a few interesting subtleties, and you see this API in the OSS repo in pfsd_sdk.h. The VFS layer implemented for Postgres is in polar_fd.h, which is a slight superset of the API given in pfsd_sdk.h. I’m assuming the lack of a pfs_fsync() means all pfs_pwrite()s are immediately durable, and though pfsd_fsync() exists in pfsd_sdk.h, it has a comment of /* mock */ over it. Postgres is a known user of sync_file_range(), which I’m assuming is equally no-op’d. Volumes are mounted, and are dynamically growable or shrinkable, but most filesystems generally aren’t incredibly compatible with being dynamically resized. There is both direct IO and buffered IO support, even though the API doesn’t indicate it.

The given API describes PolarFS’s file system layer which maps directories and files down onto blocks within the mounted volume. The contents of a directory or the blocks associated with a file are written as blocks, with a root block holding the root directory’s metadata. To transactionally update a set of blocks (so that read replicas see a consistent filesystem), there is a journal file which serves as a WAL for file system updates, and libpfs implements disk paxos to coordinate between replicas who is allowed to write into the journal.

The storage layer provides interfaces to manage and access volumes for the file system layer. A volume is divided into 10GB chunks, which are distributed across ChunkServers. The large chunk size was chosen to minimize metadata overhead so that it’s practical to maintain the entire chunk-to-server mapping in memory in PolarCtrl. Each ChunkServer manages ~10TB of chunks, so this still offers a reasonable ratio for practical load balancing on ChunkServers. Within a ChunkServer, each chunk is divided into 64KB blocks which are allocated and mapped on demand. Each chunk is thus 640KB of metadata to track chunk LBA to block location, or 640MB for all 1000 chunks per server.

PolarFS Write Path
PolarFSWritePath

PolarSwitch is a daemon that runs alongside any application using libpfs. Libpfs forwards IO requests over a shared memory ring buffer to PolarSwitch, and PolarSwitch then divides the IO requests into per-chunk requests, references its in-memory mapping of chunk-to-server and sends out the requests. Completions are reported via another shared ring buffer (similar to io_uring). The reasoning for maintaining this as a separate daemon isn’t given, but I’m assuming it was forced as utilizing RDMA as the network transport means that either only one process can use the NIC, or in the case of vNICs, a fixed number of processes that’s less than the number of instances per host they wish to run.

ChunkServers run on the disaggregated storage servers, with one ChunkServer per SSD on a dedicated CPU core. (Which implies they have SSDs which are at least 10TB is size?) Each chunk contains a WAL which is kept on a 3D XPoint SSD (aka Intel Optane). Replication across ChunkServers is done using ParallelRaft, a Raft variant optimized to permit out-of-order completions. SPDK is used to maximize IOPS per core, and is why each ChunkServer gets a dedicated core so that it may poll infinitely. Likely due to the large chunk and total data size, ChunkServers are given a reasonably high tolerance for being offline.

PolarCtrl is the control plane deployed on a dedicated set of machines. It manages membership and liveness for ChunkServers, maintaining volume and chunk-to-server mappings, assigning of chunks to ChunkServers, and distributing metadata to PolarSwitch instances.

Raft serializes all operations to a log, and commits them in-order only. This causes write requests serialized later in the log to wait for all previous writes to be committed before their own response can be sent out. This caused throughput to drop by half as write concurrency was raised from 8 to 32. As a result, Raft was altered to allow out-of-order acknowledgements from replies and commit responses back to clients, and to permit holes in the Raft log. They detail the effect that this had on leader election and replica catchup. This novel variant effectively transforms Raft into generalized multi-paxos, and no explanation was given as to why they didn’t just implement that directly rather than adapting Raft into it.

Disk snapshots are supported by PolarFS by PolarSwitch tagging requests with a snapshot tag on subsequent requests to ChunkServers. On receiving a new snapshot tag, ChunkServers will snapshot by copying their LBA-to-block-location mapping, and will modify those blocks in a copy-on-write fashion afterwards. After a ChunkServer reports having taken the snapshot, PolarSwitch stops adding the snapshot tag to requests to that ChunkServer.

The evaulation section shows that PolarFS adds minimal overhead as compared to a local ext4 volume, and with latency ~10x lower than Ceph and 2x higher throughput. Just to review, it achieved those results by packing extra large SSDs (>10TB), Intel Optane, RDMA, and large amounts of RAM, each of which is individually expensive, all into one deployment cluster, and special cased an infrastructure stack for it. Not cheap, nor (given everything I’ve heard about using SPDK and RDMA) easy to write, deploy, or maintain.

PolarDB Computational Storage

Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, Zhenjun Liu, Feng Zhu, and Tong Zhang. 2020. POLARDB Meets Computational Storage: Efficiently Support Analytical Workloads in Cloud-Native Relational Database. In 18th USENIX Conference on File and Storage Technologies (FAST 20), USENIX Association, Santa Clara, CA, 29–41. [scholar]

This paper is more focused on the computational storage side of integrating SmartSSDs (in the form of ScaleFlux’s product) into a database, and the database they happen to have chosen for this work is a disaggregated one. However, I’ve included it in this listing because it’s the only paper that gets into the topic of tight integration between page servers and compute for pushdown in detail. I’ll be doing a disservice to the actual paper in this summary, and focusing only on the pushdown aspect.

The draw of pushdown in a disaggregated architecture is to minimize the amount of processing done on non-matching data. Pushing table scan filters from compute nodes to storage nodes reduces the number of rows or pages that the storage nodes must send over the network. With computational storage, those filters can be pushed all the way to the SSD itself, removing the need to even send non-matching rows over the PCIe bus. However, it is moving compute work from the compute node to storage, and compute resources are much more limited in storage. Rather than scale up the compute resources of the storage nodes, Alibaba elected to increase the compute of the storage devices themselves by utilizing SSDs with on-board FPGAs.

PolarDB Scan Pushdown Architecture
PolarDBComputationalStorage

The required changes in PolarDB start at the scan operator. PolarDB read data from files by requesting blocks by their offset within the file. That has been enhanced to include schema of the table and the preciate to apply to the block request. The ChunkServers split the predicates into those that can be pushed to the FPGA, and those that need to be evaluated on the CPU. In the PolarFS paper, ChunkServers are described as having a one-to-one relationship with an attached 10TB SSD and tracking 64KB sized blocks. In this paper, ChunkServers stripe data across a number of SmartSSDs with 4MB stripes, and 4KB blocks are snappy compressed and thus variable length. ChunkServers split the request into one per stripe, and forward them to the corresponding SmartSSDs.

The computational storage device has a corresponding driver in Linux which exposes it as a block device.[1] The ChunkServer sends the driver the scan request. The driver reorders filters to match the hardware’s pipelined table record decoding and translates logical blocks to physical blocks on the NAND flash memory. The driver also splits larger scans into smaller ones to avoid head-of-line blocking causing high latency for concurrent requests. [1]: See NVMe Computational Storage Standardization if you’d like more of a view into how SmartSSD<->Host integration works.

PolarDB was modified to be more accomodating to efficient, simple evaluation of predicates. The encoding format for keys and values were changed to always be memcmp()-orderable, so that the FPGA wouldn’t need to understand different value encoding formats and comparisons for them. Blocks were also changed from having a footer with metadata to a header with metadata, so that decoding of the block could happen as it’s being read.

Their evaluation compares no pushdown, CPU-only pushdown, and computational storage (CSD) pushdown on TPC-H. Query latency for uncompressed CPU-based pushdown and CSD-based pushdown look like very similar 2-3x improvementes, which is unsurprising as it reflects that the majority of the gain is from freeing the one compute instance from receiving data, evaluating the filter, and then throwing it away. With compressed data, the CSD-based pushdown is a bit noticably better, as decompression isn’t free, but can be done efficiently in hardware. The PCIe and Network Traffic graphs per query show that each layer of pushdown removes another 2-3x of network traffic (CPU-based pushdown) or PCIe traffic (CSD-based pushdown).

PolarDB Serverless ★

Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang, Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao, Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD/PODS '21), ACM, 2477–2489. [scholar]

The PolarDB Serverless paper is about leveraging a multi-tenant scale-out memory pool, built via RDMA. This makes them also a disaggregated memory database! As a direct consequence, memory and CPU can be scaled independently, and the evaluation shows elastically changing the amount of memory allocated to a PolarDB tenant.

However, implementing a page cache over RDMA isn’t trivial, and a solid portion of the paper is spent talking about the exact details of managing latches on remote memory pages and navigating b-tree traversals. Specifically, B-tree operations which change the structure of the tree required significant care. Recovery also has to deal with that the remote buffer cache has all the partial execution state from the failed RW node, so the new RW node has to release latches in the shared memory pool and throw away pages which were partially modified. I’ll be eliding all the RDMA-specific details, and just covering the parts that would equally apply to a slower, TCP-based memory disaggregation architecture as well. There’s also a lot packed into this paper, as it covers PolarDB and PolarFS enhancements as well, so be warned.

They offer an architecture diagram for PolarDB as a whole:

PolarDB Architecture
PolarDBArchitecture

However, there’s a few things I think it doesn’t represent well:

PolarDB Serverless extends this to add a remote memory pool, which allows read-only and read-write to share the same buffer pool. Remote memory access is performed via librmem, which exposes the API:

int page_register(PageID page_id,
                  const Address local_addr,
                  Address& remote_addr,
                  Address& pl_addr,
                  bool& exists);
int page_unregister(PageID page_id);
int page_read(const Address local_addr,
              const Address remote_addr);
int page_write(const Address local_addr,
               const Address remote_addr);
int page_invalidate(PageID page_id);

The minimum unit of allocation is a 1GB physically contiguous slab, which is divided into 16KB pages (because PolarDB is MySQL, and MySQL uses 16KB pages). A slab node holds multiple slabs, and database instances allocate slabs across multiple slab nodes to meet their predefined buffer pool capacity when they’re first started. The first allocated slab is nominated as the home node, and is assigned the responsibility of hosting the buffer cache metadata for the database instance. The Page Address Table (PAT) tracks the slab node and physical address of each page. The Page Invalidation Bitmap (PIB) is updated when a RW node has a local modification to a page which hasn’t been written back yet (and is used by RO nodes to know when they’re stale). The Page Reference Directory (PRD) tracks what instances currently hold references to each page described in the PAT. The Page Latch Table (PLT) manages a page latch for each entry in the PAT.

PolarDB Serverless Remote Buffer Pool
PolarDBServerlessRemoteMemory

page_register is a request to the home node to either increment the refcount for the page and return its address, or allocate a new page (evicting an old one if necessary to make space) and return that. (This isn’t reading the page from storage, as there’s no direct Slab Node<->PolarFS communication, just allocating space on the remote buffer pool.) page_unregister decrements the reference count allowing the page to be freed if needed. Dirty pages can always be immediately evicted as PolarDB can materialize pages on demand from the ChunkServers. If the buffer pool size is expanded, the home node expands its PAT/BIP/PRD metadata accordingly, and allocates slabs eagerly. If the buffer pool size is shrunk, then extra memory is released by freeing pages, the exist pages are defragmented, and then the now unused slabs are released. Note that the defragmentation and physically contiguous memory is only needed to permit one-sided RDMA reads/writes, and a non-RDMA implementation could likely be simpler and non-contiguous.

Each instance has a local page cache in RAM, because there’s no L1/L2/L3 cache for remote memory. This local cache is tunable and defaults to \(min(sizeof(RemoteMemory)/8, 128GB)\), which was set by observing the effects on TPC-C and TPC-H benchmarks. Not all pages read from PolarFS are pushed into remote memory: pages read from full table scans are only read into the local page cache, and then are discarded. Modifications to pages are still performed only in local cache. If the page exists in the remote buffer pool, it must first be marked as invalidated before it can be modified, and before it can be dropped from the local cache it must be written back to the remote buffer pool (the flow of which is show in the diagram above). Insertions and deletions optimistically traverse the tree without locks, assuming they won’t need to split/merge any pages, and restart into a pessamistic locking traversal if it’s determined that it is necessary. (Interestingly in contrast to Socrates, which just has RO nodes restart their btree traversals whenever they encounter child pages of an older version than the parent page.)

There were a few improvements made to PolarDB, which are presented as seemingly unrelated to the disaggregated memory architecture, but I believe are a direct consequence. The snapshot isolation implementation was changed to utilize a centralized timestamp service, which is queried for both the read timestamp and commit timestamp. All rows have a commit timestamp suffixed to make MVCC visibility filtering easy, and a Commit Timestamp Log was added which records the commit timestamp of a transaction to allow resolving commit timestamps of recently committed data. The need for a remote timestamp service and tracking commit timestamp per row is so that promoting a Read-Only replica to the Read-Write leader doesn’t require scanning all the data. There’s no need to recover the next valid commit timestamp, as it’s held in a remote service. There’s no need to rebuild metadata of what transactions were concurrent shouldn’t see each others' effects, as MVCC visibility rules are a strict timestamp filter and rows without commit timestamps can be incrementally resolved. (This also results in a MVCC and transaction protocol which looks a lot like TiDB’s.) Similarly, PolarDB Serverless finally justified adding the GetPage@LSN request to PolarFS that every other disaggregated OLTP system already had (see, for example, the Socrates overview).

There’s a couple optimizations to transaction and query processing that they specifically call out. Read-only nodes don’t acquire latches in the buffer pool unless the RW node says it modified the B-tree structure since the Read-only node’s last access. They also implement a specific optimization for indexes: a prefetching index probe operation. Fetching keys from the index will generate prefetches to load the pointed-to data pages from the page servers, under the assumption that they’ll be immediately requested as part of SQL execution anyway.

In the event of the loss of the RW node, the Cluster Manager will promote a RO node to the new RW node. This involves collecting the \(min(max LSN per chunk)\) and requesting redo logs to be processed to bring all chunks to a consistent version. All invalidate pages in the remote memory pool are evicted (using the Page Invalidtion Bitmap so it’s not a full scan of GBs of data), along with any pages whose version is newer than the redo’d recovery version. All locks held by the failed RW node are released. All active transactions are recovered from the headers of the undo log. Then notifies the Cluster Manager its recovery is complete and rolls back the active transactions in the background. If a RW node voluntarily gives up its status as the writer to another node, it can flush all modified pages and drop all locks to save the RO node the work of applying redo logs and evicting pages from the buffer pool. In a drastic event where all replicas of the home slab are lost, all slabs are cleared, and all database nodes are restarted so that recovery restores a consistent state.

The evaluation shows the impact of all the above evaluations on recovery time. With no optimizations, unavailability lasted ~85s, and recovery back to original performance takes 105s. With page materialization on PolarFS, it’s reduced to an unavailability of ~15s and full performance after 35s. With remote memory buffer pool, it’s an unavailability of ~15s, and full performance after 23s. A voluntary handoff by the RW node leads to 2s of unavailability and full performance after 6s. Otherwise, the graphs show about one would expect that memory can be scaled elastically, and performance improves/degrates with more/less memory, respectively.

Discussion

They still undersold the RDMA difficulty. Someone who has worked with it previously commented that there’s all sorts of issues about racing reads and writes, and getting group membership and shard movement right is doubly hard. In both cases, an uninformed client can still do one-sided RDMA reads from a server they think is still a part of a replication group and/or has the shard it wants.

PolarDB-X

Wei Cao, Feifei Li, Gui Huang, Jianghang Lou, Jianwei Zhao, Dengcheng He, Mengshi Sun, Yingqiang Zhang, Sheng Wang, Xueqiang Wu, Han Liao, Zilin Chen, Xiaojian Fang, Mo Chen, Chenghui Liang, Yanxin Luo, Huanming Wang, Songlei Wang, Zhanfeng Ma, Xinjun Yang, Xiang Peng, Yubin Ruan, Yuhui Wang, Jie Zhou, Jianying Wang, Qingda Hu, and Junbin Kang. 2022. PolarDB-X: An Elastic Distributed Relational Database for Cloud-Native Applications. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), IEEE, 2859–2872. [scholar]

PolarDB-X is targeting three problems: cross-DC transactions, to extend PolarDB to more than one region; elasticity, by automatically adding read-only replicas and partitioning write responsibilities; and HTAP, by identifying and steering analytical and transactional queries to separate replicas. At a high level, PolarDB-X is the Vitess or Citus of PolarDB. Individual PolarDB instances become partitions in the broader PolarDB-X distributed, shared-nothing database. It is also open source, and available at polardb/polardbxGitHub. It seems in a very similar vein to the mostly un-published Amazon Aurora Limitless.

PolarDB-X Architecture
PolarDBXArchitecture

Above PolarDB, PolarDB-X adds a Load Balancer and set of Computation Nodes per PolarDB instance (DN & SN), with one Global Meta Service (GMS) for system metadata. The GMS is the control plane for PolarDB-X, and manages cluster membership, catalog tables, table/index partitioning rules, locations of shards, statistics, and MySQL system tables. The Load Balancer is the user’s entry point to PolarDB-X, which is exposed as a single geo-aware virtual IP address. The Computation Node coordinates read and write queries across the shards of tables stored in different PolarDB instances. For read queries, it decides if the local snapshot is fresh enough to avoid needing to go to a cross-AZ leader. For write queries, it manages the cross-shard transaction, if needed. It includes a cost-based optimizer and query executor, which it uses to break queries into per-shard queries, and apply any cross-shard evaluation needed to produce the final result. For an overview of the Database Node (PolarDB) or Storage Node (PolarFS), see their respective paper overviews above.

PolarDB-X hashes the primary key to assign rows to shards, by default. Not detailed in the paper, but the PolarDB-X Partitioned table docs describe that the supported partitioning strategies are: SINGLE, for unsharded tables; BROADCAST, for replicating the table on each shard; and PARTITION BY HASH, RANGE, LIST (manually assigned partitioning), or COHASH (HASH but multiple columns have the same value). Indexes can be defined as either global or local, where local indexes always index the data within the same shard. Tables with identical partition keys can be declared as a table group, and identical values will always result in the rows being stored on the same shard, thus predictably accelerating equi-joins.

The cross-DC replication is done by having PolarDB ship redo logs across datacenters. The replication is done through/in conjunction with a Paxos implementation managing the leadership and advancing of the Durable LSN as follows reply. Transations are divided into mini-transactions, and shipped incrementally in batches of redo logs (with other intermixed transactions). When the last mini-transaction of a user’s transaction is marked durable, the transaction has been committed.

To implement cross-shard transactions, PolarDB-X layers another MVCC and transaction protocol on top. They use a Hybrid Logical Clock to implement Snapshot Isolation. HLCs were chosen to not rely on tight physical clock synchronization, and do avoid the centralized clock server of a TiDB/Percolator-like approach. (Note that this does mean they technically sacrifice linearizability.) They include a few optimizations to reduce the number of times they bump the causality counter in HLCs, but otherwise, it’s a standard HLC and 2PC implementation. The public documentation instead describes the use of a Timestamp Oracle, and describes the GMS as serving that functionality to the Compute Nodes.

PolarDB-X OSS Architecture
PolarDBXHTAP

PolarDB-MT is an extension of PolarDB to natively understand multi-tenanting. A tenant is a set of schemas, databases, and tables. Cross-tenant operations are not permitted. A single PolarDB instance supports multiple tenants, and all operations are sent through the assigned RW node’s redo log. The tenant-to-RW-database-node mapping is stored in the GMS, and the RW node maintains a lease for the tenants it holds. Tenants can be transferred by suspending and transferring all active work and flushing dirty pages, then tranferring the lease. In the case of a failure, tenants can be split across other RW PolarDB instances, who divide the failed instance’s redo log by tenant and run recovery accordingly. What’s the difference between a shard and a tenant? The paper doesn’t answer at all. The public documentation on tenants describes it as a user-facing feature which is a performance isolated container for users and databases. It also seems likely that, much like Nile, tenants are used internally to binpack customers onto machines more efficiently.

PolarDB-X also powers an HTAP solution, where row-wise RW database nodes also asynchronously replicate into columnar Read-Only database nodes. (Which is a very TiDB/TiFlash take on HTAP.) A cost-based optimizer in the CN identifies OLAP queries, and dispatches them to the columnar database nodes. Portions of the analytical query are pushed down to the Storage Nodes (aka PolarFS), as an extension of the work described in PolarDB Computational Storage above. The Compute Node is nominated as the Query Coordinator, which breaks the query into fragments that can be distributed and executed on other Compute Nodes for parallel processing. Query execution is timesliced into 500ms jobs so that many queries may make progress concurrently. The threadpool for analytical processing work is placed under a cgroup to limit its resource usage, where as transactional processing is unconstrainted. The details on the analytical engine itself are published in the next paper: PolarDB-IMCI.

The evaluation section doesn’t hold any major surprises. They saw 19% higher sysbench throughput using HLCs rather than a timestamp oracle. Scaling operations complete within 4-5 seconds, without major distruptions. Having columnar data available improved the execution time of queries which highly benefit from columnstores.

Huawei

TaurusDB ★

Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Cui, Qiang Liu, Wei Huang, Yong Xiao, and Yongjun He. 2020. Taurus Database: How to be Fast, Available, and Frugal in the Cloud. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS '20), ACM, 1463–1478. [scholar]

The entire "Background and Related Work" section is a great read. They set up excellent and concise comparisons against the same systems we’ve discussed above. In very short summary: PolarFS (not PolarDB Serverless) uses a filesystem abstraction without smart storage and thus loses efficiency, Aurora uses 6-node quorums for both logs and pages which over-promises on durability and availability respective, and Socrates added too much complexity with its four teir Compute/XLOG/Page Server/XSTORE architecture.

Taurus Architecture
TaurusArchitecture

In Taurus’s Log Store, WAL segments are sent to a fixed-size append-only synchronously replication storage object called a PLog (Part of a Log?). In a deployment, there’s hundreds of Log Servers. Three are chosen to form a PLog. All three must ack the write, otherwise a new PLog is allocated. (It’s reconfiguation-based replication!) The database WAL is an ordered collection of PLogs, itself stored in a PLog. Metadata PLogs are chained as a linked list.

The Page Stores behave roughly the same, they accept logs and serve versioned pages. Page Stores are notified of the oldest LSN which still might be requested, and must be able to answer what the hightest LSN they can serve is.

Taurus abstracts most of the logic of dealing with Log Stores and Page Stores into a Storage Abstraction Layer, which manages the mapping of WAL segments to PLogs and slices to Page Stores. The paper describes the read and write flow in detail, but it didn’t feel notably different from any of the previously discussed systems.

Taurus Write Path
TaurusWritePath

For anyone who is against reconfiguration-based replication because of the "unavailability" while reconfiguring to a new set of available replicas, you’ll hate the "comparison with quorum replication". They argue that their probability of write unavailability is effectively zero as all Log Stores or Page Stores from their global pool of nodes would have to be unavailable for a new shard to be un-allocatable. This argument both is and isn’t true.

Both recovery and replication to read-only replicas is discussed in decent detail, but neither felt notably different. I do appreciate the level of detail though in illustrating how recovery works, as it was more pleasant to go through here than in some other papers. Replication to read-only has just been about applying log records to cached pages in every system thus far. They do mention separating notifying replicas that there were WAL changes published (and where to find them), from actually serving that data from Log Servers, so that the primary isn’t responsible for the network bandwidth of broadcasting WAL changes. The Page Stores also gossip the data so that Log Servers aren’t being entirely taxed for network bandwidth either.

Page stores are append-only on disk, with a lock-free hashtable mapping (page,version) to slot in log. The hashtable is periodically saved to storage to bound recovery time. Page Stores have their own buffer pool, which is mostly to avoid IO during the lookup of the previous page to apply a WAL entry. There’s an interesting tidbit that LFU is a better cache replacement policy for second-level caches.

What’s the major difference between Taurus and others? Reconfiguration-based replication!

Multi-Master ★

Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, and Calvin Sun. 2023. Taurus MM: Bringing Multi-Master to the Cloud. Proceedings of the VLDB Endowment 16, 12 (August 2023), 3488–3500. [scholar]

The suggested reading of this paper is, admittedly, mostly an excuse to discuss multi-master designs within disaggregated OLTP. Aurora had multi-master implemented, which they’ve since reverted. Socrates was against multi-master. PolarDB mentioned the global page cache means they could support it, but such work was out of scope for the paper. So TaurusDB is our chance to look at this design.

Taurus Multi-Master Architecture
TaurusMMArchitecture

Multi-master means concurrent modifications, and naively that means LSN is now a vector clock. Introduces a clock type that’s a hybrid between a vector clock and a scalar lamport clock. Basically, for server 3, clock[3]=lamport clock and the rest of the indexes are a vector clock. This has the effect of advancing the server’s clock faster, as it’s effectively a counter of causally related global events rather than local events. Times when causality is already known, like operations serialized by contending on a lock, Taurus uses the scalar clock. Logs and pages are locally recorded with a scalar clock, and sent to the Log Service with a vector clock. Page reads are done with a scalar clock.

The other side of concurrent modifications is that page locking can no longer be done locally in RAM on one primary replica. So the paper next discusses locking. Locks are held globally in a Global Lock Manager at page granularity with the usual Shared/eXclusive locking scheme. Once a master has a page lock, it can grant equal or lesser row locks. Pages can be unlocked and returned to the GLM if another master requests the page, but the rows will stay locked. (Imagine wanting exclusive locks on different rows in the same page.) The Global Lock Manager would also be responsible for deadlock detection.

Note the introduction of another component: the Global Slice Manager. Sharding pages across servers is a decision that no master is allowed to make locally, so the responsibility of sharding data was moved to a global component.

In comparison against Aurora Multi-Master, it’s noted that Aurora pushed resolving conflicts between masters to the storage layer. In the evaluation, the two designs perform similarly when there’s no data sharing, but the Taurus design performs much better as data sharing increases.

Discussion

MariaDB Xpand actually did something similar to this, but they never wrote about it, and the project was shut down by MariaDB.

Multi-master is also useful for upgrades, as it gives one a way to do a rolling upgrade to a new database binary and incrementally shift transactions over. However, having two databases live at different versions means one also has to get upgrade/downgrade testing done well.

Who needs multi-master? Aurora dropped their own multi-master support, and rumor was it wasn’t getting heavily used. Is there actually a desire for this? Are there enough customers topping over their disaggregated OLTP database with excessive writes that it’s worthwhile to make the investment into all the complexity that multi-master brings?


See discussion of this page on Reddit, HN, and lobsters.