↑ Notes On ↑

Disaggregated OLTP Systems

These notes were first prepared for an informal presentation on the various cloud-native disaggregated OLTP RDBMS designs that have been getting published and it cherry-picked one paper per notable design decision. For the papers covered then, I’ve included a summary of the discussion we had after each paper. This page is being actively extended to cover all disaggregated OLTP papers, even for papers that are similar between two different vendors. ("Actively" meaning as of 2025-07-30.)

If you’re looking for a quick overview of the disaggregated OLTP space, just read the papers with a ★.

Amazon

Aurora ★

Read these two papers together, and don’t try to stop to understand all the fine details about log consistency across replicas and commit or recovery protocols in the first paper. That material is covered in more detail (and with diagrams!) in the second paper. I’d almost suggest reading the second paper first.

Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD/PODS'17), ACM, 1041–1052. [scholar]
Alexandre Verbitski, Anurag Gupta, Debanjan Saha, James Corey, Kamal Gupta, Murali Brahmadesam, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvilli, and Xiaofeng Bao. 2018. Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD/PODS '18), ACM. [scholar]

As the first of the disaggregated OLTP papers, they introduce the motivation for wanting to build a system like Aurora. Previously, one would just run an unmodified MySQL on top of EBS, and when looking at the amount of data transferred, the same data was being sent in different forms multiple times. A log, a binlog, a page write, and the double-write buffer write, are all essentially doing the same work of sending a tuple from MySQL to storage.

MySQL on EBS

Thus, Aurora introduces using the write-ahead log as the protocol between the compute and the storage in a disaggregated system. The page writes, double-write buffer, etc. are all removed and made the responsibility of the storage after receiving the write-ahead log. The papers we’re looking at all reference this model with the phrase the log is the database in some form as part of their design.

Aurora Architecture

The major idea they present is that the network is then the bottleneck in the system, and the smart storage is able to meaningfully offload work of processing WAL updates into page modifications, handle MVCC cleanup, checkpointing, etc.

So I think the easiest way to get started is to zoom in on a single storage node:

Aurora Storage Node

It involves the following steps:

receive log record and add to an in-memory queue,

persist record on disk and acknowledge,

organize records and identify gaps in the log since some batches may be lost,

gossip with peers to fill in gaps,

coalesce log records into new data pages,

periodically stage log and new pages to S3,

periodically garbage collect old versions, and finally

periodically validate CRC codes on pages.

Storage nodes are used as part of a quorum, and the classic "tolerate loss of 1 AZ + 1 machine" means 6-node quorums with |W|=4 and |R|=3. The quorum means that transient single node failures (either accidental network blips or intentional node upgrades) are handled seamlessly. However, this isn’t traditional majority quorums. The Primary is an elected sole writer, which transforms majority quorums into something more consistent. Page server quorums are also reconfigured on suspected failure. This is a replication design that doesn’t even fit cleanly into my Data Replication Design Spectrum blog post.

Each log write contains its LSN, and also includes the last PSN sent to the storage group. Not every write is a transaction commit. There’s a whole discussion of Storage Consistency Points in the second paper to dig into the exact relationships between the Volume Complete LSN and the Consistency Point LSN and the Segment Complete LSN and a Protection Group Complete LSN. The overall point to get here is that trying to recover a consistent snapshot from a highly partitioned log is hard.

There’s a recovery flow to follow when the primary fails. A new primary must contact every storage group to find what’s the highest LSN below which all log records are known, and then recover to min(max LSN per group), but again, that’s a summary, because the reality seems complicated. However, the work of then applying the redo logs to properly recover to that LSN is now parallelized across many storage nodes, leading to a faster recovery.

As there’s only one nominated writer for the quorum of nodes, the writer knows which nodes in the quorum have accepted writes up to what version. This means that reads don’t need to be quorum reads, the primary is free to send read requests only to one of the at-least-four nodes that it knows should have the correct data.

Read-only replicas consume the binlog from the primary, and apply to cached pages only. Uncached data comes from storage groups. S3 is used for backups.

Discussion

Is this a trade of decreasing the amount of work on writes at the cost of increasing the amount of work on reads?: Moving the storage to over the network does add some cost, reconstructing full pages at arbitrary versions isn’t cheap, and while MySQL could apply the WAL entry directly to the buffer cached page the storage node might have to fetch the old page from disk. But much of the work is work that MySQL would otherwise be doing: finding old versions of tuples by chaining through the undo log, fuzzy checkpointing, etc. So while fetching pages from disk over a network is slower than fetching them locally, it is a good argument that it lets MySQL focus more on the query execution and transaction processing than storage management.

Aurora Multi-Master

Aurora Multi-Master was made generally available in August of 2019 and was deprecated in 2023. Though there’s no publications about how Aurora Multi-Master worked, there was an AWS Re:Invent talk and an HPTS talk which gave some details on its internals.

The major design decision referenced in other multi-master papers is that conflicts between the multiple masters were resolved optimistically at commit time.

Aurora Serverless

Bradley Barnhart, Marc Brooker, Daniil Chinenkov, Tony Hooper, Jihoun Im, Prakash Chandra Jha, Tim Kraska, Ashok Kurakula, Alexey Kuznetsov, Grant McAlister, Arjun Muthukrishnan, Aravinthan Narayanan, Douglas Terry, Bhuvan Urgaonkar, and Jiaming Yan. 2024. Resource Management in Aurora Serverless. Proceedings of the VLDB Endowment 17, 12 (August 2024), 4038–4050. [scholar]

This paper describes the transition from their naive Aurora Serverless v1 (ASv1) to Aurora Serverless v2 (ASv2). It covers both the product dimensions of billing and end-user experiences, and the internal technical parts of how to orchestrate scaling up/down, managing load, and transferring user workloads with minimum distruption. ASv1 relied upon relaunching a database instance in order to change its scale. A multi-tenant proxy frontend was created to allow sessions to be transferred between a rapidly restarted database instance. This session transfer was incomplete (temporary tables couldn’t be transferred), disruptive (due to transient unavailability), and inelastic as paying the cost of a restart only made sense for large (power of 2) instance size changes. The goal of ASv2 was to be able to scale faster, less disruptively, and be able to better track a cyclical workload.

Customers buy Aurora Serverless in units of Aurora Capacity Units (ACUs), which is a combination of 2GB RAM + 0.25 vCPU + an undefined amount of networking and block device throughput. Users define a ceiling and floor in ACU of what they wish for their database to scale up or down to, and then Aurora Serverless tries to autoscale to approximate fully elastic, usage-driven pricing.

Aurora Serverless is split into fleet-wide, inter-host rebalancing; and host-local, intra-host, in-place scaling.

Instance Managers gather resource usage information for database instances on a host, and work within the host’s resource limits to scale instances up or down to meet the resource needs. The Fleet Manager controls database instance to host assignment. Hosts' resources are oversubscribed, and when hosts are under resource pressure (at a critical level for CPU, allocated RAM, network, or disk throughput), the Fleet Manager will assign temporary ACU limits and live migrate database instances to redistribute heat across the cluster and relieve the resource pressure. The scale-up rate is limited by the Instance Manager to give the Fleet Manager time to react. The Fleet Manager will not live migrate from hosts which are deemed not to have the available network bandwidth to sustain an out-migration. New database instances are placed assuming minimum ACU usage. The Fleet Manager also adjusts the size of the fleet according to predicted and actual demand.

The Fleet Manager must choose what instance to move, and to which host to move it. Choosing an instance is a three step process: remove any ineligible instances, compute a preferences score (e.g. don’t move frequently moved instances, prefer instances that have ack’d a heartbeat recently), and compute a numerical score (how much resources will be freed up, combined with what fraction of unused resources does this instance have). Instances with equal preference scores are tiebroken by numerical score. Target host selection proceeds similarly: ineligible hosts are removed, compute a preference score (fault tolerance distribution, no recent migration failures), and a numerical score (best-fit binpacking score, and most utilized resource percentage). In the evaluation, they show that this 3 phase approach does a better job of distributing load across the fleet than a baseline of just best-fit with less instance movement.

Database instances are wrapped in VMs for security reasons, and thus resource elasticity must be done in cooperation with the guest OS of each VM. Every VM is of the same 128 ACU maximum instance size. This relies on Nitro’s SR-IOV support for having efficient virtualized IO. Memory elasticity required a number of changes: memory can be offlined to prevent it from being used for page cache and so that Linux doesn’t keep a page table entry around for every page, cold pages are swapped out, and 4KB pages are coalesced to make 2MB sized free pages which can be reclaimed by the hypervisor. Memory scales up based on the desired buffer pool size over the past 30 seconds, and down over the past 60 seconds. CPU scales up based on P50 over the past 30 seconds, and down by P70 over the past 60 seconds. Scaling up is done using the maximum of the two, scaling down uses the minimum.

Aurora Limitless

Aurora Limitless reuses the "Aurora" brand, but is much more similar to a shared-nothing distributed database like Spanner than it is to the Aurora database we’ve been discussing thus far. If you’re interested in learning about Limitless anyway, the only released information on it has been as part of AWS Re:Invent talks.

Aurora Limitless Architecture

Microsoft Socrates ★

Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD/PODS '19), ACM. [scholar]

The paper spends some time talking about the previous DR architecture, its relevant behavior and features, and its shared nothing design. There’s also a decent amount of discussion around about adapting a pre-existing RDBMS to the new architecture. It’s overall a very realistic discussion of making major architectural changes to a large, pre-existing product, but I’m not going to focus on either as this is only a disaggregated OLTP overview.

The architecture of Socrates is well illustrated in the paper:

Socrates Architecture

Socrates XLOG Service

Their major design decisions are:

All processes have a local disk-based cache. (More on this below.)
Azure Premium Storage is used as a LandingZone (LZ) for low latency and high durability.
A router XLOG process for availability of WAL entries and for dissemination to page servers.
XStore is long term storage for log blocks, and is Azure standard storage.

The primary has a recoverable buffer pool to minimize impact from failures by modeling the buffer pool as a table in an in-memory storage engine. A buffer pool on SSD might seem silly, but otherwise a cold start means dumping gigabytes worth of page fetches at Page Servers, with terrible performance until the working set is back in cache. This is implemented by implementing the extended buffer pool as an in-memory table in Hekaton.

There is a separate XLOG service which is responsible for the WAL. The primary sends log to LZ and XLOG in parallel. XLOG buffers received WAL segments until the primary informs it the segments are durable in the LZ, at which point they’re forwarded onto the page servers. It also has a local cache, and moves log segments to blob storage over time.

Page servers don’t store all pages. They have a large (and persistent) cache, but some pages live only on XStore. They’re working on offloading bulk loading, index creation, DB reorgs, deep page repair, and table scans to Page Servers as well.

The GetPage@LSN RPC serves the page at a version that’s at least the specified LSN. Page servers thus aren’t required to materialize pages at any version, and can keep only the most recent. B-tree traversals from replicas sometimes need to restart if a leaf page is a newer LSN than the parent.

What’s the major difference between Socrates and Aurora? Aurora partitions the WAL across page servers. Socrates has a centralized WAL service.

Discussion

Socrates feels like a very modern object storage-based database in the WarpStream or turbopuffer kind of way for it being a 2019 paper. This architecture is the closest to Neon’s as well.

The extended buffer pool / "Resilient Cache" on the primary sounds like a really complicated mmap() implementation.

Would VM migration keep the cache?: Probably not? This raised an interesting point that trying to binpack SQL Server instances across a fleet of instances seems difficult, especially with them all being tied to a persistent cache. Azure SQL Database is sold in vCPU and DTU models, which seem to be more reservation based, so maybe there isn’t an overly high degree of churn?
Are the caches actually local SSD or are they Azure Managed Disks?: Consensus was that it seemed pretty strongly implied that they were actually SSD.

Alibaba

As broad context, Alibaba is really about spending money on fancy hardware. I had talked about this a bit in Modern Database Hardware, but Alibaba’s papers quickly illustrate that they’re more than happy to sol seems to be more than happy to solve difficult software problems by spending significant stacks of money on very modern hardware. Notably, Alibaba has RDMA deployed out internally, seemingly to the same extent that Microsoft does, except Microsoft seems to keep a fallback-to-TCP option for most of their stack, and Alibaba seems comfortable building services that critically depend on RDMA’s primitives.

PolarFS

Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment 11, 12 (August 2018), 1849–1862. [scholar]

Alibaba took an unusual first step in building a disaggregated OLTP database. Instead of spending their effort building a separate pageserver and modifying the database to request pages from it and offload recovery to it, they invested effort into just building a sufficiently fast distributed filesystem. A year after the paper was published, Alibaba opensourced PolarFS as AsparaDB/PolarDB-FileSystem (and PolarDB as ApsaraDB/PolarDB-for-PostgreSQL, with the PolarFS usage included), and so I’ve sprinkled links to it in the summary.

In terms of architectural components: libpfs is the client library that exposes a POSIX-like filesystem API, PolarSwitch is a process run on the same host which redirects I/O requests from applications to ChunkServers, ChunkServers are deployed on storage nodes to serve I/O requests, and PolarCtrl is the control plane. PolarCtrl’s metadata about the system is stored in a MySQL instance. The only necessary modifications to PolarDB were to port the filesystem calls to libpfs.

PolarFS Architecture

The libpfs API is given as:

int     pfs_mount(const char *volname, int host_id)
int     pfs_umount(const char *volname)
int     pfs_mount_growfs(const char *volname)

int     pfs_creat(const char *volpath, mode_t mode)
int     pfs_open(const char *volpath, int flags, mode_t mode)
int     pfs_close(int fd)
ssize_t pfs_read(int fd, void *buf, size_t len)
ssize_t pfs_write(int fd, const void *buf, size_t len)
off_t   pfs_lseek(int fd, off_t offset, int whence)
ssize_t pfs_pread(int fd, void *buf, size_t len, off_t offset)
ssize_t pfs_pwrite(int fd, const void *buf, size_t len, off_t offset)
int     pfs_stat(const char *volpath, struct stat *buf)
int     pfs_fstat(int fd, struct stat *buf)
int     pfs_posix_fallocate(int fd, off_t offset, off_t len)
int     pfs_unlink(const char *volpath)
int     pfs_rename(const char *oldvolpath, const char *newvolpath)
int     pfs_truncate(const char *volpath, off_t len)
int     pfs_ftruncate(int fd, off_t len)
int     pfs_access(const char *volpath, int amode)

int     pfs_mkdir(const char *volpath, mode_t mode)
DIR*    pfs_opendir(const char *volpath)
struct dirent *pfs_readdir(DIR *dir)
int     pfs_readdir_r(DIR *dir, struct dirent *entry,
                      struct dirent **result)
int     pfs_closedir(DIR *dir)
int     pfs_rmdir(const char *volpath)
int     pfs_chdir(const char *volpath)
int     pfs_getcwd(char *buf)

Which has a few interesting subtleties, and you see this API in the OSS repo in pfsd_sdk.h. The VFS layer implemented for Postgres is in polar_fd.h, which is a slight superset of the API given in pfsd_sdk.h. I’m assuming the lack of a pfs_fsync() means all pfs_pwrite()s are immediately durable, and though pfsd_fsync() exists in pfsd_sdk.h, it has a comment of /* mock */ over it. Postgres is a known user of sync_file_range(), which I’m assuming is equally no-op’d. Volumes are mounted, and are dynamically growable or shrinkable, but most filesystems generally aren’t incredibly compatible with being dynamically resized. There is both direct IO and buffered IO support, even though the API doesn’t indicate it.

The given API describes PolarFS’s file system layer which maps directories and files down onto blocks within the mounted volume. The contents of a directory or the blocks associated with a file are written as blocks, with a root block holding the root directory’s metadata. To transactionally update a set of blocks (so that read replicas see a consistent filesystem), there is a journal file which serves as a WAL for file system updates, and libpfs implements disk paxos to coordinate between replicas who is allowed to write into the journal.

The storage layer provides interfaces to manage and access volumes for the file system layer. A volume is divided into 10GB chunks, which are distributed across ChunkServers. The large chunk size was chosen to minimize metadata overhead so that it’s practical to maintain the entire chunk-to-server mapping in memory in PolarCtrl. Each ChunkServer manages ~10TB of chunks, so this still offers a reasonable ratio for practical load balancing on ChunkServers. Within a ChunkServer, each chunk is divided into 64KB blocks which are allocated and mapped on demand. Each chunk is thus 640KB of metadata to track chunk LBA to block location, or 640MB for all 1000 chunks per server.

PolarFS Write Path

PolarSwitch is a daemon that runs alongside any application using libpfs. Libpfs forwards IO requests over a shared memory ring buffer to PolarSwitch, and PolarSwitch then divides the IO requests into per-chunk requests, references its in-memory mapping of chunk-to-server and sends out the requests. Completions are reported via another shared ring buffer (similar to io_uring). The reasoning for maintaining this as a separate daemon isn’t given, but I’m assuming it was forced as utilizing RDMA as the network transport means that either only one process can use the NIC, or in the case of vNICs, a fixed number of processes that’s less than the number of instances per host they wish to run.

ChunkServers run on the disaggregated storage servers, with one ChunkServer per SSD on a dedicated CPU core. (Which implies they have SSDs which are at least 10TB is size?) Each chunk contains a WAL which is kept on a 3D XPoint SSD (aka Intel Optane). Replication across ChunkServers is done using ParallelRaft, a Raft variant optimized to permit out-of-order completions. SPDK is used to maximize IOPS per core, and is why each ChunkServer gets a dedicated core so that it may poll infinitely. Likely due to the large chunk and total data size, ChunkServers are given a reasonably high tolerance for being offline.

PolarCtrl is the control plane deployed on a dedicated set of machines. It manages membership and liveness for ChunkServers, maintaining volume and chunk-to-server mappings, assigning of chunks to ChunkServers, and distributing metadata to PolarSwitch instances.

Raft serializes all operations to a log, and commits them in-order only. This causes write requests serialized later in the log to wait for all previous writes to be committed before their own response can be sent out. This caused throughput to drop by half as write concurrency was raised from 8 to 32. As a result, Raft was altered to allow out-of-order acknowledgements from replies and commit responses back to clients, and to permit holes in the Raft log. They detail the effect that this had on leader election and replica catchup. This novel variant effectively transforms Raft into generalized multi-paxos, and no explanation was given as to why they didn’t just implement that directly rather than adapting Raft into it.

Disk snapshots are supported by PolarFS by PolarSwitch tagging requests with a snapshot tag on subsequent requests to ChunkServers. On receiving a new snapshot tag, ChunkServers will snapshot by copying their LBA-to-block-location mapping, and will modify those blocks in a copy-on-write fashion afterwards. After a ChunkServer reports having taken the snapshot, PolarSwitch stops adding the snapshot tag to requests to that ChunkServer.

The evaulation section shows that PolarFS adds minimal overhead as compared to a local ext4 volume, and with latency ~10x lower than Ceph and 2x higher throughput. Just to review, it achieved those results by packing extra large SSDs (>10TB), Intel Optane, RDMA, and large amounts of RAM, each of which is individually expensive, all into one deployment cluster, and special cased an infrastructure stack for it. Not cheap, nor (given everything I’ve heard about using SPDK and RDMA) easy to write, deploy, or maintain.

PolarDB Computational Storage

Wei Cao, Yang Liu, Zhushi Cheng, Ning Zheng, Wei Li, Wenjie Wu, Linqiang Ouyang, Peng Wang, Yijing Wang, Ray Kuan, Zhenjun Liu, Feng Zhu, and Tong Zhang. 2020. POLARDB Meets Computational Storage: Efficiently Support Analytical Workloads in Cloud-Native Relational Database. In 18th USENIX Conference on File and Storage Technologies (FAST 20), USENIX Association, Santa Clara, CA, 29–41. [scholar]

This paper is more focused on the computational storage side of integrating SmartSSDs (in the form of ScaleFlux’s product) into a database, and the database they happen to have chosen for this work is a disaggregated one. However, I’ve included it in this listing because it’s the only paper that gets into the topic of tight integration between page servers and compute for pushdown in detail. I’ll be doing a disservice to the actual paper in this summary, and focusing only on the pushdown aspect.

The draw of pushdown in a disaggregated architecture is to minimize the amount of processing done on non-matching data. Pushing table scan filters from compute nodes to storage nodes reduces the number of rows or pages that the storage nodes must send over the network. With computational storage, those filters can be pushed all the way to the SSD itself, removing the need to even send non-matching rows over the PCIe bus. However, it is moving compute work from the compute node to storage, and compute resources are much more limited in storage. Rather than scale up the compute resources of the storage nodes, Alibaba elected to increase the compute of the storage devices themselves by utilizing SSDs with on-board FPGAs.

PolarDB Scan Pushdown Architecture

The required changes in PolarDB start at the scan operator. PolarDB read data from files by requesting blocks by their offset within the file. That has been enhanced to include schema of the table and the preciate to apply to the block request. The ChunkServers split the predicates into those that can be pushed to the FPGA, and those that need to be evaluated on the CPU. In the PolarFS paper, ChunkServers are described as having a one-to-one relationship with an attached 10TB SSD and tracking 64KB sized blocks. In this paper, ChunkServers stripe data across a number of SmartSSDs with 4MB stripes, and 4KB blocks are snappy compressed and thus variable length. ChunkServers split the request into one per stripe, and forward them to the corresponding SmartSSDs.

The computational storage device has a corresponding driver in Linux which exposes it as a block device.^[1] The ChunkServer sends the driver the scan request. The driver reorders filters to match the hardware’s pipelined table record decoding and translates logical blocks to physical blocks on the NAND flash memory. The driver also splits larger scans into smaller ones to avoid head-of-line blocking causing high latency for concurrent requests. [1]: See NVMe Computational Storage Standardization if you’d like more of a view into how SmartSSD<->Host integration works.

PolarDB was modified to be more accomodating to efficient, simple evaluation of predicates. The encoding format for keys and values were changed to always be memcmp()-orderable, so that the FPGA wouldn’t need to understand different value encoding formats and comparisons for them. Blocks were also changed from having a footer with metadata to a header with metadata, so that decoding of the block could happen as it’s being read.

Their evaluation compares no pushdown, CPU-only pushdown, and computational storage (CSD) pushdown on TPC-H. Query latency for uncompressed CPU-based pushdown and CSD-based pushdown look like very similar 2-3x improvementes, which is unsurprising as it reflects that the majority of the gain is from freeing the one compute instance from receiving data, evaluating the filter, and then throwing it away. With compressed data, the CSD-based pushdown is a bit noticably better, as decompression isn’t free, but can be done efficiently in hardware. The PCIe and Network Traffic graphs per query show that each layer of pushdown removes another 2-3x of network traffic (CPU-based pushdown) or PCIe traffic (CSD-based pushdown).

PolarDB Serverless ★

Wei Cao, Yingqiang Zhang, Xinjun Yang, Feifei Li, Sheng Wang, Qingda Hu, Xuntao Cheng, Zongzhi Chen, Zhenjun Liu, Jing Fang, Bo Wang, Yuhui Wang, Haiqing Sun, Ze Yang, Zhushi Cheng, Sen Chen, Jian Wu, Wei Hu, Jianwei Zhao, Yusong Gao, Songlu Cai, Yunyang Zhang, and Jiawang Tong. 2021. PolarDB Serverless: A Cloud Native Database for Disaggregated Data Centers. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD/PODS '21), ACM, 2477–2489. [scholar]

The PolarDB Serverless paper is about leveraging a multi-tenant scale-out memory pool, built via RDMA. This makes them also a disaggregated memory database! As a direct consequence, memory and CPU can be scaled independently, and the evaluation shows elastically changing the amount of memory allocated to a PolarDB tenant.

However, implementing a page cache over RDMA isn’t trivial, and a solid portion of the paper is spent talking about the exact details of managing latches on remote memory pages and navigating b-tree traversals. Specifically, B-tree operations which change the structure of the tree required significant care. Recovery also has to deal with that the remote buffer cache has all the partial execution state from the failed RW node, so the new RW node has to release latches in the shared memory pool and throw away pages which were partially modified. I’ll be eliding all the RDMA-specific details, and just covering the parts that would equally apply to a slower, TCP-based memory disaggregation architecture as well. There’s also a lot packed into this paper, as it covers PolarDB and PolarFS enhancements as well, so be warned.

They offer an architecture diagram for PolarDB as a whole:

PolarDB Architecture

However, there’s a few things I think it doesn’t represent well:

PolarFS was extended to support separate log chunks and page chunks. The WAL is committed into log chunks, and they directly state the design is closer to the Socrates XLOG than Aurora.
Due to the use of ParallelRaft, logs are sent only to the leader node of the page chunk, who will materialize pages and propagate updates to other replicas.
There’s also a timestamp service which, which uses RDMA to quickly and cheaply serve timestamps that’s not included in the diagram.

PolarDB Serverless extends this to add a remote memory pool, which allows read-only and read-write to share the same buffer pool. Remote memory access is performed via librmem, which exposes the API:

int page_register(PageID page_id,
                  const Address local_addr,
                  Address& remote_addr,
                  Address& pl_addr,
                  bool& exists);
int page_unregister(PageID page_id);
int page_read(const Address local_addr,
              const Address remote_addr);
int page_write(const Address local_addr,
               const Address remote_addr);
int page_invalidate(PageID page_id);

The minimum unit of allocation is a 1GB physically contiguous slab, which is divided into 16KB pages (because PolarDB is MySQL, and MySQL uses 16KB pages). A slab node holds multiple slabs, and database instances allocate slabs across multiple slab nodes to meet their predefined buffer pool capacity when they’re first started. The first allocated slab is nominated as the home node, and is assigned the responsibility of hosting the buffer cache metadata for the database instance. The Page Address Table (PAT) tracks the slab node and physical address of each page. The Page Invalidation Bitmap (PIB) is updated when a RW node has a local modification to a page which hasn’t been written back yet (and is used by RO nodes to know when they’re stale). The Page Reference Directory (PRD) tracks what instances currently hold references to each page described in the PAT. The Page Latch Table (PLT) manages a page latch for each entry in the PAT.

PolarDB Serverless Remote Buffer Pool

page_register is a request to the home node to either increment the refcount for the page and return its address, or allocate a new page (evicting an old one if necessary to make space) and return that. (This isn’t reading the page from storage, as there’s no direct Slab Node<->PolarFS communication, just allocating space on the remote buffer pool.) page_unregister decrements the reference count allowing the page to be freed if needed. Dirty pages can always be immediately evicted as PolarDB can materialize pages on demand from the ChunkServers. If the buffer pool size is expanded, the home node expands its PAT/BIP/PRD metadata accordingly, and allocates slabs eagerly. If the buffer pool size is shrunk, then extra memory is released by freeing pages, the exist pages are defragmented, and then the now unused slabs are released. Note that the defragmentation and physically contiguous memory is only needed to permit one-sided RDMA reads/writes, and a non-RDMA implementation could likely be simpler and non-contiguous.

Each instance has a local page cache in RAM, because there’s no L1/L2/L3 cache for remote memory. This local cache is tunable and defaults to \(min(sizeof(RemoteMemory)/8, 128GB)\), which was set by observing the effects on TPC-C and TPC-H benchmarks. Not all pages read from PolarFS are pushed into remote memory: pages read from full table scans are only read into the local page cache, and then are discarded. Modifications to pages are still performed only in local cache. If the page exists in the remote buffer pool, it must first be marked as invalidated before it can be modified, and before it can be dropped from the local cache it must be written back to the remote buffer pool (the flow of which is show in the diagram above). Insertions and deletions optimistically traverse the tree without locks, assuming they won’t need to split/merge any pages, and restart into a pessamistic locking traversal if it’s determined that it is necessary. (Interestingly in contrast to Socrates, which just has RO nodes restart their btree traversals whenever they encounter child pages of an older version than the parent page.)

There were a few improvements made to PolarDB, which are presented as seemingly unrelated to the disaggregated memory architecture, but I believe are a direct consequence. The snapshot isolation implementation was changed to utilize a centralized timestamp service, which is queried for both the read timestamp and commit timestamp. All rows have a commit timestamp suffixed to make MVCC visibility filtering easy, and a Commit Timestamp Log was added which records the commit timestamp of a transaction to allow resolving commit timestamps of recently committed data. The need for a remote timestamp service and tracking commit timestamp per row is so that promoting a Read-Only replica to the Read-Write leader doesn’t require scanning all the data. There’s no need to recover the next valid commit timestamp, as it’s held in a remote service. There’s no need to rebuild metadata of what transactions were concurrent shouldn’t see each others' effects, as MVCC visibility rules are a strict timestamp filter and rows without commit timestamps can be incrementally resolved. (This also results in a MVCC and transaction protocol which looks a lot like TiDB’s.) Similarly, PolarDB Serverless finally justified adding the GetPage@LSN request to PolarFS that every other disaggregated OLTP system already had (see, for example, the Socrates overview).

There’s a couple optimizations to transaction and query processing that they specifically call out. Read-only nodes don’t acquire latches in the buffer pool unless the RW node says it modified the B-tree structure since the Read-only node’s last access. They also implement a specific optimization for indexes: a prefetching index probe operation. Fetching keys from the index will generate prefetches to load the pointed-to data pages from the page servers, under the assumption that they’ll be immediately requested as part of SQL execution anyway.

In the event of the loss of the RW node, the Cluster Manager will promote a RO node to the new RW node. This involves collecting the \(min(max LSN per chunk)\) and requesting redo logs to be processed to bring all chunks to a consistent version. All invalidate pages in the remote memory pool are evicted (using the Page Invalidtion Bitmap so it’s not a full scan of GBs of data), along with any pages whose version is newer than the redo’d recovery version. All locks held by the failed RW node are released. All active transactions are recovered from the headers of the undo log. Then notifies the Cluster Manager its recovery is complete and rolls back the active transactions in the background. If a RW node voluntarily gives up its status as the writer to another node, it can flush all modified pages and drop all locks to save the RO node the work of applying redo logs and evicting pages from the buffer pool. In a drastic event where all replicas of the home slab are lost, all slabs are cleared, and all database nodes are restarted so that recovery restores a consistent state.

The evaluation shows the impact of all the above evaluations on recovery time. With no optimizations, unavailability lasted ~85s, and recovery back to original performance takes 105s. With page materialization on PolarFS, it’s reduced to an unavailability of ~15s and full performance after 35s. With remote memory buffer pool, it’s an unavailability of ~15s, and full performance after 23s. A voluntary handoff by the RW node leads to 2s of unavailability and full performance after 6s. Otherwise, the graphs show about one would expect that memory can be scaled elastically, and performance improves/degrates with more/less memory, respectively.

Discussion

They still undersold the RDMA difficulty. Someone who has worked with it previously commented that there’s all sorts of issues about racing reads and writes, and getting group membership and shard movement right is doubly hard. In both cases, an uninformed client can still do one-sided RDMA reads from a server they think is still a part of a replication group and/or has the shard it wants.

PolarDB-X

Wei Cao, Feifei Li, Gui Huang, Jianghang Lou, Jianwei Zhao, Dengcheng He, Mengshi Sun, Yingqiang Zhang, Sheng Wang, Xueqiang Wu, Han Liao, Zilin Chen, Xiaojian Fang, Mo Chen, Chenghui Liang, Yanxin Luo, Huanming Wang, Songlei Wang, Zhanfeng Ma, Xinjun Yang, Xiang Peng, Yubin Ruan, Yuhui Wang, Jie Zhou, Jianying Wang, Qingda Hu, and Junbin Kang. 2022. PolarDB-X: An Elastic Distributed Relational Database for Cloud-Native Applications. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), IEEE, 2859–2872. [scholar]

PolarDB-X is targeting three problems: cross-DC transactions, to extend PolarDB to more than one region; elasticity, by automatically adding read-only replicas and partitioning write responsibilities; and HTAP, by identifying and steering analytical and transactional queries to separate replicas. At a high level, PolarDB-X is the Vitess or Citus of PolarDB. Individual PolarDB instances become partitions in the broader PolarDB-X distributed, shared-nothing database. It is also open source, and available at polardb/polardbx. It seems in a very similar vein to the mostly un-published Amazon Aurora Limitless.

PolarDB-X Architecture

Above PolarDB, PolarDB-X adds a Load Balancer and set of Computation Nodes per PolarDB instance (DN & SN), with one Global Meta Service (GMS) for system metadata. The GMS is the control plane for PolarDB-X, and manages cluster membership, catalog tables, table/index partitioning rules, locations of shards, statistics, and MySQL system tables. The Load Balancer is the user’s entry point to PolarDB-X, which is exposed as a single geo-aware virtual IP address. The Computation Node coordinates read and write queries across the shards of tables stored in different PolarDB instances. For read queries, it decides if the local snapshot is fresh enough to avoid needing to go to a cross-AZ leader. For write queries, it manages the cross-shard transaction, if needed. It includes a cost-based optimizer and query executor, which it uses to break queries into per-shard queries, and apply any cross-shard evaluation needed to produce the final result. For an overview of the Database Node (PolarDB) or Storage Node (PolarFS), see their respective paper overviews above.

PolarDB-X hashes the primary key to assign rows to shards, by default. Not detailed in the paper, but the PolarDB-X Partitioned table docs describe that the supported partitioning strategies are: SINGLE, for unsharded tables; BROADCAST, for replicating the table on each shard; and PARTITION BY HASH, RANGE, LIST (manually assigned partitioning), or COHASH (HASH but multiple columns have the same value). Indexes can be defined as either global or local, where local indexes always index the data within the same shard. Tables with identical partition keys can be declared as a table group, and identical values will always result in the rows being stored on the same shard, thus predictably accelerating equi-joins.

The cross-DC replication is done by having PolarDB ship redo logs across datacenters. The replication is done through/in conjunction with a Paxos implementation managing the leadership and advancing of the Durable LSN as follows reply. Transations are divided into mini-transactions, and shipped incrementally in batches of redo logs (with other intermixed transactions). When the last mini-transaction of a user’s transaction is marked durable, the transaction has been committed.

To implement cross-shard transactions, PolarDB-X layers another MVCC and transaction protocol on top. They use a Hybrid Logical Clock to implement Snapshot Isolation. HLCs were chosen to not rely on tight physical clock synchronization, and do avoid the centralized clock server of a TiDB/Percolator-like approach. (Note that this does mean they technically sacrifice linearizability.) They include a few optimizations to reduce the number of times they bump the causality counter in HLCs, but otherwise, it’s a standard HLC and 2PC implementation. The public documentation instead describes the use of a Timestamp Oracle, and describes the GMS as serving that functionality to the Compute Nodes.

PolarDB-X OSS Architecture

PolarDB-MT is an extension of PolarDB to natively understand multi-tenanting. A tenant is a set of schemas, databases, and tables. Cross-tenant operations are not permitted. A single PolarDB instance supports multiple tenants, and all operations are sent through the assigned RW node’s redo log. The tenant-to-RW-database-node mapping is stored in the GMS, and the RW node maintains a lease for the tenants it holds. Tenants can be transferred by suspending and transferring all active work and flushing dirty pages, then tranferring the lease. In the case of a failure, tenants can be split across other RW PolarDB instances, who divide the failed instance’s redo log by tenant and run recovery accordingly. What’s the difference between a shard and a tenant? The paper doesn’t answer at all. The public documentation on tenants describes it as a user-facing feature which is a performance isolated container for users and databases. It also seems likely that, much like Nile, tenants are used internally to binpack customers onto machines more efficiently.

PolarDB-X also powers an HTAP solution, where row-wise RW database nodes also asynchronously replicate into columnar Read-Only database nodes. (Which is a very TiDB/TiFlash take on HTAP.) A cost-based optimizer in the CN identifies OLAP queries, and dispatches them to the columnar database nodes. Portions of the analytical query are pushed down to the Storage Nodes (aka PolarFS), as an extension of the work described in PolarDB Computational Storage above. The Compute Node is nominated as the Query Coordinator, which breaks the query into fragments that can be distributed and executed on other Compute Nodes for parallel processing. Query execution is timesliced into 500ms jobs so that many queries may make progress concurrently. The threadpool for analytical processing work is placed under a cgroup to limit its resource usage, where as transactional processing is unconstrainted. The details on the analytical engine itself are published in the next paper: PolarDB-IMCI.

The evaluation section doesn’t hold any major surprises. They saw 19% higher sysbench throughput using HLCs rather than a timestamp oracle. Scaling operations complete within 4-5 seconds, without major distruptions. Having columnar data available improved the execution time of queries which highly benefit from columnstores.

PolarDB-ICMI

Jianying Wang, Tongliang Li, Haoze Song, Xinjun Yang, Wenchao Zhou, Feifei Li, Baoyue Yan, Qianqian Wu, Yukun Liang, ChengJun Ying, Yujie Wang, Baokai Chen, Chang Cai, Yubin Ruan, Xiaoyi Weng, Shibin Chen, Liang Yin, Chengzhong Yang, Xin Cai, Hongyan Xing, Nanlong Yu, Xiaofei Chen, Dapeng Huang, and Jianling Sun. 2023. PolarDB-IMCI: A Cloud-Native HTAP Database System at Alibaba. Proceedings of the ACM on Management of Data 1, 2 (June 2023), 1–25. [scholar]

PolarDB-IMCI is PolarDB’s solution to HTAP. It takes until the third page to finally learn that IMCI stands for in-memory column index. The goal of PolarDB-IMCI is outlined as achieving good OLAP performance, without compromising OLTP performance, on fresh, realtime data.

The in-memory column index is maintained on a set of read-only nodes separate from those executing OLTP workloads, so that OLAP and OLTP don’t interfere. Redo logs are used to apply updates to the columnar replicas, and PolarDB-IMCI introduces commit-ahead log shipping (CALS) and 2-Phase conflict-free log replay (2P-COFFER) to minimize the staleness of the columnar replicas and efficiently parse changes. The columnar index is maintained as append-only storage, making updates and lookups by RowID fast, but requires a second index (implemented as a two-layer LSM tree) for Primary Key to RowID mapping. IMCI’s checkpointing is integrated with the PolarDB storage engine, making it possible to spin up extra columnar replicas quickly.

PolarDB-IMCI Architecture

A columnar index is defined as part of the DDL, which allows a subset of the rows of a table to be held in the in-memory columnar index. Tables are divided into chunks of 64K rows, and the indexed columns from these row groups are organized into a compressed data pack along with some metadata. The leftover rows form a partial data pack, which is left uncompressed due to the frequent udpates. The pack’s metadata offers zonemap-style metadata (min/max per column, sum, count, null, distinct) over the contained inserts. Deletes are treated as inserts of tombstones, which look up the RowID of the row being deleted by the Primary Key. An update is a delete followed by an insert. Compression is the standard columnar compression (frame of reference/delta encoding) and not deflate/lzma sort of compression.

Commit-ahead log shipping involves the Read-Write transaction node writing each DML statement out as a write-ahead log record once it has been executed. The columnar Read-Only nodes eagerly fetch this log record, parse it as a DML statement, and store it in a per-transaction buffer. Once the Read-Write node sends the final commit/abort decision, the Read-Only columnar nodes already have a buffer of logical operations to apply (if commit) or disacard (if abort). Transactions which overflow their buffer are pre-committed, and the MVCC implementation is used to hide the written data from being visible.

This work is all performed directly off of the redo write-ahead logs to avoid putting extra work on the read-write transactional node. However, redo logs reflect physical page changes and lack database-level or table-level information, page changes involve B+-Tree splits/merges as well, and only the page delta is included rather than the full update. The Two-Phase Conflict-Free Parallel Replay is to address these limitations. The first phase applies the redo logs onto an in-memory copy of the row-store to reconstruct the missing data and information. The second phase replays the full DML onto the column index, while respecting the original order of statement execution according to the LSNs in the redo log.

PolarDB-IMCI’s proxy layer plans the query under a row-based cost model. If the cost is low, it’s forwarded to the transactional replicas. If it’s high, it’s sent to a columnar replica, and re-planned to be column-oriented. This re-planning starts with the row-wise plan as its base, re-runs join ordering with the new cost model, and converts expression execution to be its vectorized equivalents. PolarDB-IMCI calculates table-wide statistics via random background sampling for use in accurate cardinality estimation in the optimizer.

The evaluation section shows significant speedups of PolarDB-IMCI over row-wise PolarDB for OLAP workloads, as one would expect of an in-memory columnar index versus an on-disk b-tree. They show performance that’s on the same order as Clickhouse, and then demonstrate the minimal impact to OLTP performance and the resource elasticity for OLAP workloads they’ve enabled.

PolarDB-SCC

Xinjun Yang, Yingqiang Zhang, Hao Chen, Chuan Sun, Feifei Li, and Wenchao Zhou. 2023. PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads. Proceedings of the VLDB Endowment 16, 12 (August 2023), 3754–3767. [scholar]

SCC stands for Strongly Consistent Cluster, and the focus of this paper is removing as much of the delay as possible between a Read-Write node committing a change and the Read-Only node becoming aware of and able to serve reads over it. They motivate the work with results that disaggregated databases' read-only replicas either have milliseconds of staleness on the read-only replicas, or that strongly consistent reads add 1x-5x additional read lattency. PolarDB-SCC uses three levels of timestamping (global, table, and page) to be able to begin pulling changes from read-write replicas sooner, and RDMA for minimizing the latency and overhead of doing so. This additional latency leads most databases to suggest sending all strongly consistent reads to the leader, thus defeating load balancing and making additional read-only replicas less useful. Allowing such workloads to be served from the read-only replicas is the exact problem PolarDB-SCC is targetting.

PolarDB-SCC Architecture

The timestamping scheme treats the read-write node as a timestamp oracle. On every modification it performs, it records a lamport clock for that modification at the global, table-level, and page-level granularity. A read-only node fetches the current timestamp at the start of a query, and it may batch this operation for many queries at once. Once the query has been assigned a timestamp, the read-only node may serve read results as long as all referenced global/table/page data is up-to-date locally. Due to the heirarchy, if e.g. the applied table-level timestamp is greater than the query’s timestamp, then it is implied that all of the pages are sufficiently up to date and do not need to be checked. The global timestamp is the maximum committed transaction’s timestamp. To avoid the overhead of maintaining an extra timestamp per page, the page’s Log Sequence Number is used as its timestamp. The Read-write node maintains the timestamps for tables and pages in hashtables, so that they may be quickly and easily queried over one-sided RDMA.

The paper goes into significant detail on the RDMA-based log shipping, which is essentially just a ringbuffer on either side with extra checks to make sure unconsumed log data being overwritten is handled correctly. The Read-Write node pushes its write-ahead logs into all of the Read-Only nodes. If any Read-Only node falls to far behind, it reads the logs from storage (PolarFS) instead. No changes were made to exist log buffer management.

Read-only queries within a transaction can also be sent to read-only nodes, but they must include the effects of writes performed earlier in the transaction. PolarDB-SCC accomplishes this by having the Read-Write node return the highest LSN generated as part of a write query to the proxy. The LSN is then attached by the proxy to subsequent read queries within the same transaction, so that the read-only node can ensure that it has the transaction’s writes applied. The load balancer will prefer sending queries to read-only nodes which have already applied up through the maximum write LSN locally.

The evaluation section shows that across SysBench and production workloads PolarDB-SCC delivers latency that’s just a fraction worse than stale reads from PolarDB Read-Only replicas. Additionally, throughput scales similarly with the stale reads workload, showing that it also burdens the Read-Write node notably less. It also permits better leveraging of read-only replicas for higher throughput on consistent queries. (The proxy and load balancer are not mentioned in the evaluation, but those components also existed previously as part of PolarDB, so it’s likely included equally on both sides.)

PolarDB Elasticity

Yingqiang Zhang, Xinjun Yang, Hao Chen, Feifei Li, Jiawei Xu, Jie Zhou, Xudong Wu, and Qiang Zhang. 2024. Towards a Shared-Storage-Based Serverless Database Achieving Seamless Scale-Up and Read Scale-Out. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), 5119–5131. [scholar]

PolarDB-MP

Xinjun Yang, Yingqiang Zhang, Hao Chen, Feifei Li, Bo Wang, Jing Fang, Chuan Sun, and Yuhui Wang. 2024. PolarDB-MP: A Multi-Primary Cloud-Native Database via Disaggregated Shared Memory. In Companion of the 2024 International Conference on Management of Data (SIGMOD/PODS '24), ACM, 295–308. [scholar]

PolarDB-MP discusses extending PolarDB to support more than one Read-Write node. PolarDB is unique in implementing memory disaggregation first, and thus their multi-primary support is heavily based around already having a shared buffer pool accessible to all replicas over RDMA. Like all enjoyably spicy papers, PolarDB-MP begins by criticizing its related work: Aurora Multi-Master used optimistic concurrency control and suffered high abort rates, Taurus-MM used pessimistic concurrency control and suffered high overhead (8 nodes yielded a 1.8x throughput increase), and IBM pureScale and Oracle RAC are too expensive as they rely on custom dedicated machines.

PolarDB-MP Architecture

PolarDB-MP is centered around the Polar Multi-Primary Fusion Server (PMFS), which comprises Transaction Fusion, Buffer Fusion, and Lock Fusion, and doubles down on being highly RDMA centric. Transaction Fusion uses a timestamp oracle and allocates shared memory on each node for its local transaction data, which itself is also accessible to all nodes via RDMA. Buffer Fusion is the distributed buffer pool that all nodes share. Lock Fusion manages both page-level and row-level locking. PolarDB-MP also extends LSNs to Logical Log Sequence Numbers, such that each node may generate LSNs and there will be a global partial order between LLSNs.

PolarDB-MP Transaction Information Table

Transaction Fusion is a global timestamp oracle and per-node local transaction information to avoid centralized transaction information management. The paper puts forward an interesting argument that MVCC allowing reads to exist concurrently with writes, though generally being an advantage, poses a significant challenge to shared-storage multi-primary databases. Determining the correctly visible tuple out of many versions requires global transaction information, which imposes a high coordination overhead. The fix to this is to decentralize transaction management: every node maintains its local transaction’s information in a Transaction Information Table (TIT), which is accessible via RDMA for other nodes. The TIT maintains a pointer to the transaction object, its Commit Timesstamp (CTS), its version (a counter to differentiate TIT entries in the same position over time), and a flag named "ref" indicating if other transactions are waiting on this one to release its locks. Transactions are globally identifed by combining the node ID, transaction ID (populated from a node-local counter), slot number within the local TIT, and the TIT entry’s version number.

When updating a tuple, PolarDB-MP places the global transaction ID into the row’s metadata. At commit time, the CTS is updated if the row is still in the buffer, otherwise it is left as CSN_INIT. PolarDB is a MySQL derivative, so it relies upon the undo log reconstructing older values of rows for MVCC, and this process is unchanged for when the read tuple is too new for the given read version. When the row’s CTS is CSN_INIT, the global transaction ID can be used to fetch the TIT entry for the transaction. If the fetched entry does not match, it means that the transaction has already committed and the TIT slot re-used. TIT slots are garbage collected by a background thread, and entries are only removed when no active statement would need to read earlier than the committed transaction, and tso he minimum timestamp of any live transaction may be assigned to the row so that it is always visible. Read timestamps have their requests to the Timestamp Oracle coalesced as described in PolarDB-SCC and uses the same Linear Lamport Timestamp.

PolarDB-MP Buffer Fusion

Buffer Fusion is a design for permitting low latency access to data pages by pushing them into a Distributed Buffer Pool (DBP). Each node maintains a Local Buffer Pool (LBP), which is a subset of the DBP. Each local buffer contains has metadata of the remote address of the buffer, and if the local buffer is valid. When an instance fetches a page from the DBP into its local buffer pool, it updates metadata in the DBP recording that it has a copy of the page. When an instance updates a page in its local buffer pool, it consults the DBP to find other nodes with the same page in their LBPs, and unsets the valid bit on them. Dirty pages in the LBP are flushed to the DBP in the background, but makes sure to force the corresponding logs to storage first so that the page is recoverable in the event of a failure.

PolarDB-MP Lock Fusion

Lock Fusion encompasses both the page-locking (PLock) and row-locking (RLock) protocols. The PLock is used to maintain atomic page access and structural consistency of the B-Tree, similar to a latch. Before performing any read or update of a page, the corresponding Shared or eXclusive PLock must be held. Each node tracks which PLocks it holds, and the reference count of the PLocks from each concurrently executing transaction. Locks are requested from the Fusion Server, and the Fusion Server notifies the awaiters when the PLock is released by a node. PLocks are speculatively held even after their local reference count drops to zero, under the assumption that locality means the same node is likely to re-request the same PLock. Structural changes to the B-Tree (splits or merges) are done while holding X-PLocks in the standard 2PL/2PC combo one would expect.

The RLock is used for transactional consistency. Locking information is embedded into the row itself as metadata, and only the waits-for relation is maintained on the Fusion Server, presumably for deadlock detection. Locking a row is done by writing the transaction’s ID into the corresponding field. Attempting to update a row means an X-PLock must already be held, so multiple primaries cannot try to update a row to lock it concurrently. If a transaction ID is already present, then it is a conflict, and the transaction must wait. The Transaction Information Table is then consulted (locally or remotely) to confirm that the RLock is held by an active transaction. A background processes synchronizes a minimal active transaction ID, to allow older transactions to be confirmed as completed without incurring remote TIT access costs. An RLock is always an exclusive lock in PolarDB-MP, there are no shared RLocks, relying on the fact that most reads come from read-only statements which may be served via MVCC locklessly.

Log Sequence Numbers are attached to each page modified, and are maintained similarly to a logical clock. When a page is read from storage or the DBP, the local LSN is potentially updated to ensure that it is equal to or greater than the read page’s recorded LSN. This ensures that updates to pages across different primaries produce redo logs that are merged into the right order when sorted by LSN.

In the evaluation, they show PolarDB-MP giving an 8x improvement with 8 primaries when the workload is fully partitioned, and a 3x improvement with 8 primaries when the workload has no partitioning. They also intentionally ran TPC-C incorrectly by setting think time and keying time to 0 and transforming it into a contention benchmark. A comparision is also done directly against Aurora’s Multi-Master and Taurus’s Multi-Master implementations, showing equal or better results. Performance of secondary index updates are compared with shared-nothing architectures where global secondary indexes are partitioned. Latency and throughput was shown to be better, largely due to RDMA usage, and I’m unclear what the point of the apples-to-oranges comparison was. A recovery test was performed to show that the loss of one primary does not affect the throughput of another primary.

PolarDB-CXL

http://cighao.com/papers/PolarDB_CXL.pdf

Huawei

GaussDB was renamed to Taurus ~2020, but both names have continued to be used in publications. The public name of the service is TaurusDB.

TaurusDB ★

Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Cui, Qiang Liu, Wei Huang, Yong Xiao, and Yongjun He. 2020. Taurus Database: How to be Fast, Available, and Frugal in the Cloud. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS '20), ACM, 1463–1478. [scholar]

The entire "Background and Related Work" section is a great read. They set up excellent and concise comparisons against the same systems we’ve discussed above. In very short summary: PolarFS (not PolarDB Serverless) uses a filesystem abstraction without smart storage and thus loses efficiency, Aurora uses 6-node quorums for both logs and pages which over-promises on durability and availability respective, and Socrates added too much complexity with its four teir Compute/XLOG/Page Server/XSTORE architecture.

Taurus Architecture

In Taurus’s Log Store, WAL segments are sent to a fixed-size append-only synchronously replication storage object called a PLog (Part of a Log?). In a deployment, there’s hundreds of Log Servers. Three are chosen to form a PLog. All three must ack the write, otherwise a new PLog is allocated. (It’s reconfiguation-based replication!) The database WAL is an ordered collection of PLogs, itself stored in a PLog. Metadata PLogs are chained as a linked list.

The Page Stores behave roughly the same, they accept logs and serve versioned pages. Page Stores are notified of the oldest LSN which still might be requested, and must be able to answer what the hightest LSN they can serve is.

Taurus abstracts most of the logic of dealing with Log Stores and Page Stores into a Storage Abstraction Layer, which manages the mapping of WAL segments to PLogs and slices to Page Stores. The paper describes the read and write flow in detail, but it didn’t feel notably different from any of the previously discussed systems.

Taurus Write Path

For anyone who is against reconfiguration-based replication because of the "unavailability" while reconfiguring to a new set of available replicas, you’ll hate the "comparison with quorum replication". They argue that their probability of write unavailability is effectively zero as all Log Stores or Page Stores from their global pool of nodes would have to be unavailable for a new shard to be un-allocatable. This argument both is and isn’t true.

Both recovery and replication to read-only replicas is discussed in decent detail, but neither felt notably different. I do appreciate the level of detail though in illustrating how recovery works, as it was more pleasant to go through here than in some other papers. Replication to read-only has just been about applying log records to cached pages in every system thus far. They do mention separating notifying replicas that there were WAL changes published (and where to find them), from actually serving that data from Log Servers, so that the primary isn’t responsible for the network bandwidth of broadcasting WAL changes. The Page Stores also gossip the data so that Log Servers aren’t being entirely taxed for network bandwidth either.

Page stores are append-only on disk, with a lock-free hashtable mapping (page,version) to slot in log. The hashtable is periodically saved to storage to bound recovery time. Page Stores have their own buffer pool, which is mostly to avoid IO during the lookup of the previous page to apply a WAL entry. There’s an interesting tidbit that LFU is a better cache replacement policy for second-level caches.

What’s the major difference between Taurus and others? Reconfiguration-based replication!

Taurus NDP

Shu Lin, Arunprasad P. Marathe, Per-Åke Larson, Chong Chen, Calvin Sun, Paul Lee, Weidong Yu, Jianwei Li, Juncai Meng, Roulin Lin, Xiaoyang Chenxi, and Qingping Zhuxii. 2022. Near Data Processing in Taurus Database. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), 1662–1674. [scholar]

Taurus Near Data Processing is motivated by MySQL query processing being designed for simple queries and short transactions, and correspondingly struggling with queries that must consume large amounts of data. Early filtering of the data saves both network load, from the separated storage and compute, and the CPU load on the compute side. NDP required changes in the optimizer, storage engine, and Page Stores, though a major goal was to minimize the effect of the NDP-related changes above the storage engine layer. Only index scan operators need to be aware of its existence in the query plan.

Taurus NDP Architecture

An NDP request is a set of filters, projections, or aggregations that the query processor would like the Page Store to evaluate. NDP pushdown requests are treated as best-effort if the Page Store has the idle CPU to perform the operations. The Page Store is permitted to not respect the request, and return raw database pages instead. If it does apply the pushdown operations, the results are returned in a special NDP Page, which is held separately in the storage engine, as it is not re-used by any other request. Predicates are JIT compiled via LLVM on the Page Server for a more efficient evaluation. Not all predicates are pushed down through NDP, and some residual predicates are still evaluated within MySQL’s execution engine. InnoDB’s undo log-based MVCC cannot be traversed by Page Stores, and thus some rows not visible to the query must be returned to indicate to InnoDB that it must reconstruct the correct older version and perform the requested processing on that row instead.

In the optimizer, NDP pushdown is applied as a post-processing pass after optimization has run. This possibly misses faster plans had NDP been considered during cost-based optimization, but NDP would also introduce more alternative plans to consider and the optimization duration impact of enabling NDP is minimized. The MySQL optimizer already pushes filters and projections on top of index scans, and NDP is just a more efficient way of evaluating the supported operators. NDP projections are enabled only when the optimizer estimates that the width reduction is sufficient, including checking statistics for average variable length column widths. NDP filtering is enabled only when supported predicates are detected (e.g. not UDFs). NDP requests may also be ignored by the Page Store, and thus the optimization is not guaranteed. A significant focus was that enabling NDP should not cause a regression for any queries.

Within InnoDB, an "NDP descriptor" is formed which contains the schema, transaction information (for mvcc visibility), and the projection, filtering, and aggregation operations to perform. An NDP request with its accompanying NDP descriptor is used by the Page Store to convert an InnoDB page (16KB) into a variable length NDP page. An NDP page contains the same InnoDB page header, such that an InnoDB cursor can be used on the page unmodified, but with the "record type" field set to indicate an NDP page and if any projections or aggregations were applied to the rows. Records are maintained in the page in index sort order. The NDP page may be narrower, because of the projection, and if empty due to the filtering an empty result is indicated specially to avoid materializing an empty page.

Taurus NDP Parallel Scans

The returned NDP pages are allocated from InnoDB’s buffer pool, but maintained private to the query thread, and the number of NDP pages allocated is controlled to avoid starving regular queries of buffer pool pages. During a scan, leaf page IDs are collected from parent pages, and batched into a single I/O request. To prevent concurrent modification to the tree structure during the large scan, a shared lock is taken from the root to the parent-of-leaf page, and an LSN corresponding to the locked tree structure is generated. The Page Store only returns page versions matching the LSN value. Any page which is already in the buffer pool is copied into the NDP buffer pool space as the page and tree structure may be modified once the LSN is recorded and the locks dropped. This batching of ~1000 page reads into one I/O request provides an opporunity for significant parallelism both on InnoDB’s side of NDP processing and the Page Store’s.

The Page Store is a multi-tenant service that simultaneously supports MySQL, PostgreSQL, and openGauss as frontends. DBMS-specific libraries are loaded as plugins, and the NDP descriptor is accepted as a typeless bytestream which the NDP plugin interprets. An NDP I/O begins as a regular page read that the NDP plugin coverts into an NDP page. NDP descriptor parsing was identified as a performance bottleneck in testing, and thus an NDP descriptor and LLVM bitcode cache was added, as many waves of requests typically all share the same NDP descriptor. A threadpool is dedicated to NDP page processing, and if the queue of pages for the NDP plugin to process exceeds the resources available, then pages are skipped and forwarded to the frontend RDBMS for processing instead.

Page Stores do non-trivial processing with NDP. Predicates are evaluated by the database frontend producing LLVM bitcode, which the Page Store converts to native code using LLVM’s MCJIT. A library of pre-compiled complex functions is also available for the JIT’d code to invoke. As identified in many previous works on query compilation, evaluating a precompiled predicate is significantly faster than interpreting the AST of a predicate expression. Page Stores can also perform aggregations across pages iff the table being accessed is the last table in the query. Then, either logically adjacent pages may be aggregated if there is a GROUP BY, or if instead it’s a scalar aggregation then even non-adjacent pages may be aggregated. In either case, only pages within the same batched I/O request are aggregated as it would be significantly harder to correlate queries across I/O requests. InnoDB performs any residual aggregation work which NDP only partially evaluated.

The evaluation shows a significant reduction in data transferred across the network by example TPC-H queries, as one would expect to see in any pushdown related work. CPU work on the primary was also reduced, as expected as CPU work was instead moved to Page Stores. The significant increase in parallelism as compared to single threaded MySQL scans also trivially yields over 90% speedups.

Taurus Multi-Master ★

Alex Depoutovitch, Chong Chen, Per-Ake Larson, Jack Ng, Shu Lin, Guanzhu Xiong, Paul Lee, Emad Boctor, Samiao Ren, Lengdong Wu, Yuchen Zhang, and Calvin Sun. 2023. Taurus MM: Bringing Multi-Master to the Cloud. Proceedings of the VLDB Endowment 16, 12 (August 2023), 3488–3500. [scholar]

The suggested reading of this paper is, admittedly, mostly an excuse to discuss multi-master designs within disaggregated OLTP. Aurora had multi-master implemented, which they’ve since reverted. Socrates was against multi-master. PolarDB mentioned the global page cache means they could support it, but such work was out of scope for the paper. So TaurusDB is our chance to look at this design.

Taurus Multi-Master Architecture

Multi-master means concurrent modifications, and naively that means LSN is now a vector clock. Introduces a clock type that’s a hybrid between a vector clock and a scalar lamport clock. Basically, for server 3, clock[3]=lamport clock and the rest of the indexes are a vector clock. This has the effect of advancing the server’s clock faster, as it’s effectively a counter of causally related global events rather than local events. Times when causality is already known, like operations serialized by contending on a lock, Taurus uses the scalar clock. Logs and pages are locally recorded with a scalar clock, and sent to the Log Service with a vector clock. Page reads are done with a scalar clock.

The other side of concurrent modifications is that page locking can no longer be done locally in RAM on one primary replica. So the paper next discusses locking. Locks are held globally in a Global Lock Manager at page granularity with the usual Shared/eXclusive locking scheme. Once a master has a page lock, it can grant equal or lesser row locks. Pages can be unlocked and returned to the GLM if another master requests the page, but the rows will stay locked. (Imagine wanting exclusive locks on different rows in the same page.) The Global Lock Manager would also be responsible for deadlock detection.

Note the introduction of another component: the Global Slice Manager. Sharding pages across servers is a decision that no master is allowed to make locally, so the responsibility of sharding data was moved to a global component.

In comparison against Aurora Multi-Master, it’s noted that Aurora pushed resolving conflicts between masters to the storage layer. In the evaluation, the two designs perform similarly when there’s no data sharing, but the Taurus design performs much better as data sharing increases.

Discussion

MariaDB Xpand actually did something similar to this, but they never wrote about it, and the project was shut down by MariaDB.

Multi-master is also useful for upgrades, as it gives one a way to do a rolling upgrade to a new database binary and incrementally shift transactions over. However, having two databases live at different versions means one also has to get upgrade/downgrade testing done well.

Who needs multi-master? Aurora dropped their own multi-master support, and rumor was it wasn’t getting heavily used. Is there actually a desire for this? Are there enough customers topping over their disaggregated OLTP database with excessive writes that it’s worthwhile to make the investment into all the complexity that multi-master brings?

GaussDB

GaussDB is a Postgres-derived RDBMS developed in Huawei. GaussDB for MySQL is a separate product that was renamed to Taurus so as to avoid confusion.

This section only exists as in introduction to the next ones, otherwise Taurus moves into GaussDB with no explanation of why there’s two systems. For most of this page, I’ve maintained chronological ordering of papers, but I’m breaking it here so that I can group the GaussDB work together.

For background on GaussDB itself, there’s a publication introducing its motivation of bringing in-memory database processing to a disk-based RDBMS:

Hillel Avni, Alisher Aliev, Oren Amor, Aharon Avitzur, Ilan Bronshtein, Eli Ginot, Shay Goikhman, Eliezer Levy, Idan Levy, Fuyang Lu, Liran Mishali, Yeqin Mo, Nir Pachter, Dima Sivov, Vinoth Veeraraghavan, Vladi Vexler, Lei Wang, and Peng Wang. 2020. Industrial-strength OLTP using main memory and many cores. Proceedings of the VLDB Endowment 13, 12 (August 2020), 3099–3111. [scholar]

Afterwards, there have been two different publications looking at improving the replication support in GuassDB to behave better with geographically distributed replicas. This replication, however, is still replicating from one GaussDB to another GaussDB, and thus doesn’t fit our theme of disaggregated OLTP. They’re still nice papers though!

Weixing Zhou, Qi Peng, Zijie Zhang, Yanfeng Zhang, Yang Ren, Sihao Li, Guo Fu, Yulong Cui, Qiang Li, Caiyi Wu, Shangjun Han, Shengyi Wang, Guoliang Li, and Ge Yu. 2023. GeoGauss: Strongly Consistent and Light-Coordinated OLTP for Geo-Replicated SQL Database. Proc. ACM Manag. Data 1, 1 (May 2023). [scholar]
Puya Memarzia, Huaxin Zhang, Kelvin Ho, Ronen Grosman, and Jiang Wang. 2024. GaussDB-Global: A Geographically Distributed Database System. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), 5111–5118. [scholar]

And now we may resume our disaggregated OLTP content…

GaussDB-MP

Guoliang Li, Wengang Tian, Jinyu Zhang, Ronen Grosman, Zongchao Liu, and Sihao Li. 2024. GaussDB: A Cloud-Native Multi-Primary Database with Compute-Memory-Storage Disaggregation. Proceedings of the VLDB Endowment 17, 12 (August 2024), 3786–3798. [scholar]

Neon

Neon has no publications, but I feel like I should cover it for the sake of completeness.

↑ Notes On ↑

See discussion of this page on Reddit, HN, and lobsters.