Modern Hardware for Future Databases

We’re in an exciting era for databases where advancements are coming along each major resource front, each of which has the potential to shape what an optimal database architecture would be. All combined, I’m hopeful that we’ll see some interesting architectural shifts in databases over the next decade, but I’m uncertain if the necessary hardware will be accessible.

Networking

From a recent talk by Stonebraker in HPTS 2024, some benchmarking with VoltDB saw that ~60% of their server-side cycles went to the TCP/IP stack. VoltDB is already a database architecture whose goal was to remove as much not-query-processing work from serving requests as possible, so this is the extreme case, however, it still makes a valid point that the computational overhead of TCP is not small, and will become ever more noticeable as network bandwidth increases. This isn’t a new observation though, and there’s an escalating series of proposed solutions.

One proposed solution is to replace TCP with another protocol that runs over UDP instead. QUIC is the frequently chosen example. However, this is misled[1]. "It is a grossly inaccurate simplification, but at its simplest level, QUIC is simply TCP encapsulated and encrypted in a User Datagram Protocol (UDP) payload." The CPU overhead of TCP and QUIC is also remarkably similar. Diverging further from TCP and specializing in specific environments would be needed to materialize notable improvements, and there’s papers like Homa showing some improvements in datacenter environments. But even with a better protocol, the better optimization potential lies in reducing the overhead of the kernel networking stack. [1]: If you’re reading this wondering why QUIC is catching a stray here, it’s because I’ve gotten into several debates over time when TCP or TLS gets blamed for some issue and moving to QUIC comes up as a suggested outcome. There are some problems QUIC can help with, and there are problems it doesn’t improve or possibly makes worse. It’s good to understand that steady-state latency and bandwidth is in the latter category.

One way to reduce the amount of work the kernel has to do is by moving the computationally intensive but simple parts to the hardware. This has been happening incrementally over time with enhancements to offload both segmentation and checksumming to the NIC. A more recent enhancement of KTLS allows offloading packet encryption in TLS to the NIC as well. Attempts at offloading all of TCP to the hardware, in the form of a TCP Offload Engine (TOE), have been systematically rejected by Linux maintainers. So these have been nice enhancements, but significant parts of the TCP stack still remain a responsibility of the kernel.

Thus another solution is to remove the kernel as the middleman between the NIC and the application. Frameworks such as Data Plane Development Kit (DPDK) permits userspace to poll the network card for packets, removing the overhead of interrupts, and keeping all the processing in userspace means no transitions into and out of the kernel. DPDK has also seen struggles in adoption, as it requires exclusive control of a NIC. One thus needs to have 2 NICs per host, one for DPDK and one for the OS and every other process. Marc Richards put together a nice Linux Kernel vs DPDK benchmark, that ends with DPDK offering a 50% increase in throughput, followed by an enumeration of the slew of drawbacks one accepts to gain that 50%. It seems to be a tradeoff most databases aren’t interested in, and even ScyllaDB has mostly dropped its investment into it.

Newer hardware presents an interesting new option: removing the CPU from the networking path. RDMA (Remote Direct Memory Access) offers verbs, a limited set of operations (essentially read, write, and 8-byte CAS) that can be performed entirely from within the NIC, with no CPU interaction. Cutting out the CPU means close to 1us of latency for a remote read, versus the >100us latency of TCP. As part of RDMA, the responsibility of packet loss and flow control is also pushed down entirely to the NIC[2]. Cutting out the CPU also means large volumes of data can be transferred without the CPU becoming the bottleneck. [2]: Why is it acceptable to push loss detection and flow control into the hardware for RDMA, but Linux maintainers keep rejecting it for TCP? Because it’s a different, and much more limited API, which reduces the NIC<->Host complexity. TCP offload is a dumb idea whose time has come is a fun read in this area. (From 2003!)

Having RDMA as a low latency and high throughput networking primitive changes how one can design databases. The End of a Myth: Distributed Transactions Can Scale shows that RDMA’s low latency lets the classic 2PL+2PC scale to large clusters. Is Scalable OLTP in the Cloud a Solved Problem? pitches the idea of having shared writeable page cache across nodes, because low latency means tighter coupling of components becomes feasible. RDMA isn’t just for OLTP databases either; BigQuery uses an RDMA shuffle-based join, because of the high throughput. Changing the basic numbers on latency and CPU utilization at a given throughput changes what design is the best, or unblocks new designs that previously weren’t considered feasible.[3] [3]: For using RDMA, I’d strongly suggest using libfabric as it abstracts over all the different RDMA vendors and libraries. The RDMAmojo blog has years worth of RDMA-specific content and is one of the best places to learn about all aspects of RDMA.

Lastly, there’s a class of even newer hardware that finishes the trend of placing even more computing power in the NIC itself, in the form of SmartNICs or Data Processing Units (DPUs). They permit arbitrary processing to be pushed down to the NIC, and potentially invoked in response to requests from other NICs. These are rather recent, and I’d suggest looking at DPDPU: Data Processing with DPUs for an overview, DDS: DPU-Optimized Disaggregated Storage for how to integrate them into a database, and Azure Accelerated Networking: SmartNICs in the Public Cloud for details about deploying them. In general, I expect SmartNICs to extend RDMA from simple reads and writes to allowing CPU-bypass general RPCs (for computationally cheap to reply to requests).

Storage

There are advances in storage devices that aim to improve the total cost of ownership for storage devices in specialized use cases. Manufacturers cleverly noted that one can read smaller width stripes of magnetized HDD platters than the stripe width that writing produces, and so one can overlap tracks to leave the minimal width. Thus, we gained Shingled Magnetic Recording HDDs, which introduced the concept of storage being split into zones which only supported being appended to or erased. SMR HDDs are targeted at use cases like object storage where access is infrequent but large volumes of data must be stored.

Similar ideas have been applied to SSDs, and Zonal SSDs also exist. Exposing zones within an SSD means that the drive doesn’t need to offer a Flash Translation Layer (FTL) or a complex garbage collection process. Similar to SMR, this reduces the cost of a ZNS SSD as compared to a "regular" SSD, but there’s an additional focus on application-driven[4] garbage collection being more efficient, thus decreasing total write amplification and increasing drive lifetime. Consider LSMs on SSDs, which already operate via incremental appending and large erase blocks. Removing the FTL between an LSM and the SSD opens opportunity for optimizations. More recently, Google and Meta have collaborated on a proposal for Flexible Data Placement (FDP), which acts as more of a hint for grouping writes with related lifetimes than strictly enforcing the partitioning as ZNS does. The goal is to enable an easier upgrade path where an SSD could ignore the FDP part of the write request and still be semantically correct, just have worse performance or write amplification. [4]: See libzbc for SMR HDD usage, and xNVMe for ZNS SSDs usage.

[5]: If you were expecting a discussion of persistent memory, but Intel killed off Optane, so that’s a dead end for now. It seems that there are a few companies, like Kioxia or Everspin continuing on, but I haven’t heard anything about their usage.

Other improvements target not cost efficiency[5], but improving the set of features that storage devices support. Focusing on NVMe in particular, NVMe added a copy command, to remove the waste in reading and writing the same data. Fused compare-and-write commands allow a CAS operation to be pushed down to the drive itself allowing for crazy designs like pushing Optimistic Lock Coupling down into the drive itself. NVMe inherited the Data Integrity Field (DIF) / Data Integrity Extensions (DIX) support from SCSI which allows pushing page checksums down into the drive. (Notably used by Oracle.) There are projects like KV-SSD which change the entire data model from storing blocks by index to storing objects by key, and head towards replacing software storage engines entirely. SSD manufacturers continue to make SSDs more capable of more operations over time.

As the penultimate step in SSD capabilities, SmartSSDs are coming into existence which allow for putting arbitrary compute into an SSD. Query processing on SmartSSDs: Opportunities and challenges surveys their application to query processing tasks. Pushing filters to storage is always advantageous; I’ve regularly linked previous work like PushdownDB leveraging S3 Select[6] as a great example on the analytics side. With SmartSSDs we get papers like POLARDB Meets Computational Storage. Even without specialized integration, there are arguments to be made that even transparent, in-drive compression can close the gap between B+ trees and LSMs in write amplification. Leveraging SmartSSDs is still a young field of research, but there’s incredible potential for impact. [6]: AWS un-launched S3 Select as of July 25th, 2024, presumably in favor of S3 Object Lambda.

Compute

Transaction Processing

In a recent VLDB, two powerhouses of database research put forth a position paper of Cloud-Native Database Systems and Unikernels: Reimagining OS Abstractions for Modern Hardware, arguing that unikernels allow databases to specialize an OS for its exact needs. The early work on VMCache highlights the struggle in efficient database buffer management in particular, where one either accepts the complexity of pointer swizzling, or one hooks into the kernel and invokes mmap()-related syscalls frequently. Neither option is appealing, and unikernels instead offer direct access to virtual memory primitives. The effort required to develop unikernels is lowering as the area is getting more attention, and Akira Kurogane got MongoDB running as a unikernel via Unikraft with little effort, and subsequent posts showed a bit of performance improvement without any MongoDB-internal changes. There’s been an endless joke that databases want to become the OS as the desire for performance improvements would require more control over networking, filesystems, disk I/O, memory, etc., and unikernel databases offer exactly that as a tangible possibility.

For data confidentiality beyond just TLS or disk encryption, secure envclaves allow execution of verifyably untampered code such that the data operated on is protected from even a compromised operating system. Whereas a Trusted Platform Module (TPM) allowed keys to be held securely with a machine, secure enclaves extend to arbitrary code and data. This permits building databases which are tremendously more resilient to malicious compromise but with several constraints on their design. Microsoft has published on integrating secure enclaves into Hekaton, and has released the work as part of SQL Server Always Encrypted. Alibaba has also published about their efforts in building enclave-native storage engines for enterprise customers worried about data confidentiality. Databases have a history of being able to sell security improvements through the vehicle of regulatory compliance, and secure enclaves are a meaningful improvement in data confidentiality.

After Spanner’s introduction of TrueTime, clock synchronization has become of notable interest for transaction ordering in geo-distributed databases. Each of the major cloud providers has an NTP offering that is tied to atomic clocks or GPS satellites (AWS, Azure, GCP). This is of great utility to any similar design, like CockroachDB or Yugabyte, for which clock synchronization is vital for correctness, and conservatively wide margins of error degrade performance. AWS’s recent Aurora Limitless also uses a TrueTime-like design. This is the only mention of cloud-specific not-quite-hardware because it is major cloud vendors exposing expensive hardware (atomic clocks) that users otherwise wouldn’t have considered buying for themselves.

Hardware transactional memory has had a rather ill-fated history. Sun’s Rock processor featured hardware transactional memory right up until Sun was bought and Rock was shut down. Intel made two attempts at releasing it, and had to disable it both times[7]. There was some interesting work on the subject of applying hardware transactional memory to in-memory databases, but other than finding some old CPUs for experimentation, we all must wait until a CPU manufacturer says they’re planning to make another attempt at it. [7]: The first time due to a bug, and the second time due to a side-channel attack breaking KASLR. There was also a speculative execution timing attack discovered via misunderstanding the intention of a CTF challenge.

Query Processing

There have consistently been companies founded that are looking to leverage specialized hardware to accelerate query processing and achieve better performance and cost efficiency than their CPU-only competitors. GPU-powered databases, like Voltron, HEAVY.ai, and Brytlyt, are the first step in this direction. I wouldn’t be overly surprised if Intel or AMD integrated graphics gained OpenCL support[8] sometime in the future, which would open the door to all databases being able to assume some amount of GPU capabilities on a much wider set of hardware configurations. [8]: OpenGL Compute Shaders are the most generic and portable form of using GPUs for arbitrary compute, and those are supported by integrated graphics chipsets already. I can’t find any database-related papers looking into using them though?

There are also opportunities for using even more power-efficient hardware. The newest Neural Processing Units/Tensor Processing Units have already been shown to be adaptable into query processing in work like TCUDB: Accelerating Database with Tensor Processors. A few companies have attempted to utilize FPGAs. Swarm64 tried (and failed?) at this market. AWS did their own effort as Redshift AQUA. Going as far as ASICs seems to not be worth it for even the largest companies, as even Oracle stopped their SPARC development in 2017. I’m not overly optimistic about FPGAs through ASICs as memory bandwidth will be the primary bottleneck at some point anyway, but ADMS is the conference[9] to follow for papers in this overall area. [9]: Okay, technically ADMS is a workshop attached onto VLDB, but I don’t know the word which generalizes over conferences, journals, and workshops.

Cloud Availability

To finally address the depressing elephant in the room, none of these hardware advancements matter if they’re not accessible. For today’s systems, that means in the cloud, and the cloud doesn’t offer the forefront of hardware advancements to its customers.

For networking, the situation isn’t fantastic. DPDK is the most advanced networking technology that’s somewhat easily accessible, as most clouds allow some types of instances to have more than one NIC. AWS offers pseudo-RDMA in the form of Secure Reliable Datagrams, which was benchmarked to be about halfway between TCP and RDMA. Real RDMA is only available on the High Performance Computing instances within Azure, GCP, and OCI. Only Alibaba offers RDMA on general-purpose compute instances[10]. SmartNICs are not available anywhere publicly. Some of this is for good reason: Microsoft has published papers that deploying RDMA is hard. In fact, it’s really hard. Even their papers about actually succeeding in using RDMA emphasize that it’s really hard. We’re nearing a full decade after Microsoft started using RDMA internally and it’s still not available in their cloud. I have no guesses as to if or when it will be. [10]: Though there’s likely a bit of a latency hit similar to how SRDs are worse. Alibaba deployed RDMA via iWARP, which should be a bit slower, but I haven’t seen any benchmarks.

For storage, the situation isn’t really any better. The few times that SMR HDDs did reach consumers, it was as a drive that still presented itself as supporting a block storage API, and consumers hated it. ZNS SSDs seem similarly locked behind enterprise-only purchasing agreements. One might think that Intel discontinuing Optane-branded persistent memory and SSDs would mean that they’re not accessible on the cloud, but Alibaba still offers persistent memory optimized instances. The wonderful folk at Spare Cores actually provided me with nvme id-ctrl output from each cloud vendor, and none of the NVMe devices they pulled present themself as supporting nearly any optional features: copy, fused compare and write, data integrity extensions, nor multi-block atomic writes[11]. Alibaba is also the only cloud vendor which has invested into SmartSSDs with their collaboration with ScaleFlux on PolarDB. This still means SmartSSDs are not accessible to the general public, but even the paper acknowledges it’s "the first real-world deployment of cloud-native databases with computational storage drives ever reported in the open literature". [11]: Even though AWS supports torn write prevention and GCP used to have similar docs.

On the compute side, the state finally gets a bit better. The cloud fully permits unikernels, TPMs are widely accessible, but only AWS and Azure support secure enclaves as far as I can tell. NTP servers with atomic or GPS clocks are made available, but only AWS makes efforts of promising error bounds. without promised error bounds that make it possible to critically rely on. (Hardware transactional memory isn’t available, but it’s hard to blame the clouds on that one.) The explosion of AI means there’s good money behind making more efficient compute available. GPUs are available in all clouds. AWS[12], Azure, IBM, and Alibaba offer FPGA instances. (GCP and OCI don’t.) The unfortunate reality is also that faster compute only matters when compute is the bottleneck. Both GPUs and FPGAs suffer from having limited memory, and so one cannot maintain the database in their local memory. Instead, one relies on streaming data in and out of them, which means being limited by PCIe speeds. All of this would encourage thoughtful motherboard layout and bus design in an on-premise appliance, but that’s not feasible in the cloud. [12]: One would ideally like peer-to-peer DMA support to be able to read from disk straight into the FPGA, and at least AWS’s F1 cannot.

Thus we end with my bleak view on the next generation of databases: no one[13] can build databases that critically depend on new hardware advancements until they’re made available, but no cloud vendor wants to deploy hardware that can’t be immediately used. The next generation of databases are being held hostage by the cyclic dependency that they don’t yet exist. [13]: Except for the cloud vendors themselves. Most notably, Microsoft and Google already have RDMA internally and leverage it extensively in their database offerings, while not permitting public use of it. There’s been a post outline trapped in my drafts for a long time now titled "The Competitive Advantage of RDMA for Cloud Vendors".

Alibaba is shockingly great though. They’re consistently at the forefront of making hardware advances available for everything. I’m surprised I don’t see Alibaba being frequently used for benchmarking in academia and industry correspondingly.


See discussion of this page on Reddit, HN, and lobsters.