↑ Blog ↑ Torn Write Detection and Protection ↦

Decomposing Transactional Systems

Every transactional system does four things:

It executes transactions.
It orders transactions.
It validates transactions.
It persists transactions.

Executing a transaction means evaluating the body of the transaction to produce the intended reads and writes. There is still notable variety across systems as to how the body of a transaction is executed. Writes might be applied to storage during this phase, or they might be buffered locally and submitted as a batch at the end. A transaction might be executed more than once for different purposes.

Ordering a transaction means assigning the transaction some notion of a time at which it occurred. This could be a version, a timestamp, a log sequence number, or a more complex description of transaction IDs it happened before or after. MVCC databases may assign two versions: an initial read version, and a final commit version. In this case, we’re mainly focused on the specific point at which the commit version is chosen — the time at which the database claims all reads and writes occurred atomically.

Validating a transaction means enforcing concurrency control, or more rarely, domain-specific semantics. If a transaction is going to be rejected for a system defined reason, such as having serializability conflicts with existing transactions, it will happen here. When validation happens after ordering, it checks to see if the assigned order is valid. When validation happens before ordering, it provides a range of acceptable commit versions, and the ordering step chooses one of them.

Persisting a transaction makes making it durable, generally to disk. Sometimes writes are incrementally made durable during transaction execution, but the key point in persistence is when all writes and the commit record marking the transaction as committed are durable. Often this is noting how the system performs replication and persists the outcome of its atomic commitment protocol. (And sometimes the lines between those two aren’t very clear.)

All four of these things must be done before the system may acknowledge a transaction’s result to a client. However, these steps can be done in any order. They can be done concurrently. Different systems achieve different tradeoffs by reordering these steps.

Examples

A classic optimistic concurrency control database will execute a transaction, and record the read and write sets. Once execution finishes, a commit version is allocated, and concurrent transactions are checked for conflicts. If no conflicts were found, the transaction is made durable, and acknowledged as committed.

Classic optimistic concurrency control databases break down as:

A classic pessimistic concurrency control database executes a transaction and acquires locks as it runs to exclude conflicting transactions. When the transaction finishes, it acquires a commit version, and then is persisted to disk. Then it releases the locks.

Classic pessimistic concurrency control databases break down as:^[1] [1]: Don’t try to infer anything out of these diagrams other than the just how the steps relate to each other. These diagrams have no relationship to the performance or efficiency of different designs. Thinner diagrams or more concurrent of steps aren’t necessarily better.

Now, let’s burn through some real world examples.

FoundationDB

FoundationDB ^[2] is a distributed, transactional database, which unbundles the standard database architecture into a set of database microservices which may be scaled independently. The paper describes its transaction processing flow as: [2]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. 2021. FoundationDB: A Distributed Unbundled Transactional Key Value Store. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), Association for Computing Machinery, New York, NY, USA, 2653–2666. [scholar]

A client transaction starts by contacting one of the Proxies to obtain a read version (i.e., a timestamp). The Proxy then asks the Sequencer for a read version that is guaranteed to be no less than any previously issued transaction commit version, and this read version is sent back to the client. Then the client may issue multiple reads to StorageServers and obtain values at that specific read version. Client writes are buffered locally without contacting the cluster. At commit time, the client sends the transaction data, including the read and write sets (i.e., key ranges), to one of the Proxies and waits for a commit or abort response from the Proxy. If the transaction cannot commit, the client may choose to restart the transaction from the beginning again.

A Proxy commits a client transaction in three steps. First, the Proxy contacts the Sequencer to obtain a commit version that is larger than any existing read versions or commit versions. The Sequencer chooses the commit version by advancing it at a rate of one million versions per second. Then, the Proxy sends the transaction information to range-partitioned Resolvers, which implement FDB’s optimistic concurrency control by checking for read-write conflicts. If all Resolvers return with no conflict, the transaction can proceed to the final commit stage. Otherwise, the Proxy marks the transaction as aborted. Finally, committed transactions are sent to a set of LogServers for persistence. A transaction is considered committed after all designated LogServers have replied to the Proxy, which reports the committed version to the Sequencer (to ensure that later transactions' read versions are after this commit) and then replies to the client. At the same time, StorageServers continuously pull mutation logs from LogServers and apply committed updates to disks.

So from breaking down this part of the paper, we see

A client executes a transaction.
The proxy acquires a commit version.
The transaction is conflict checked.
The transaction is made durable.

All as sequential steps.

A familiar diagram! FoundationDB operates as a classic optimistic concurrency control database. With this in mind, if you’re trying to understand one piece of the system (e.g. distributed conflict checking in the resolvers), you can replace the other pieces temporarily with the minimal equivalents from the most simplistic optimistic database implementation (e.g. replace the log servers with appending to a single WAL on one disk).

Spanner

Spanner ^[3] is Google’s flagship database, used extensively within Google, and inspired a generation of NewSQL databases that followed in its architectural footsteps^[4] of partitioned Paxos and distributed transactions. For Spanner’s description of its transaction protocol, we see §4.2.1 Read-Write Transactions: [3]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. 2012. Spanner: Google’s Globally-Distributed Database. In OSDI. [scholar] [4]: And comparing specifically how CockroachDB, TiDB, and YugabyteDB all differ from Spanner and each other is an insightful adventure.

Like Bigtable, writes that occur in a transaction are buffered at the client until commit. As a result, reads in a transaction do not see the effects of the transaction’s writes. This design works well in Spanner because a read returns the timestamps of any data read, and uncommitted writes have not yet been assigned timestamps.

Reads within read-write transactions use wound-wait to avoid deadlocks. The client issues reads to the leader replica of the appropriate group, which acquires read locks and then reads the most recent data. While a client transaction remains open, it sends keepalive messages to prevent participant leaders from timing out its transaction. When a client has completed all reads and buffered all writes, it begins two-phase commit. The client chooses a coordinator group and sends a commit message to each participant’s leader with the identity of the coordinator and any buffered writes. Having the client drive two-phase commit avoids sending data twice across wide-area links.

A non-coordinator-participant leader first acquires write locks. It then chooses a prepare timestamp that must be larger than any timestamps it has assigned to previous transactions (to preserve monotonicity), and logs a prepare record through Paxos. Each participant then notifies the coordinator of its prepare timestamp.

The coordinator leader also first acquires write locks, but skips the prepare phase. It chooses a timestamp for the entire transaction after hearing from all other participant leaders. The commit timestamp s must be greater or equal to all prepare timestamps (to satisfy the constraints discussed in Section 4.1.3), greater than TT.now().latest at the time the coordinator received its commit message, and greater than any timestamps the leader has assigned to previous transactions (again, to preserve monotonicity). The coordinator leader then logs a commit record through Paxos (or an abort if it timed out while waiting on the other participants).

Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until TT.after(s), so as to obey the commit-wait rule described in Section 4.1.2. Because the coordinator leader chose s based on TT.now().latest, and now waits until that timestamp is guaranteed to be in the past, the expected wait is at least 2 * epsilon. This wait is typically overlapped with Paxos communication. After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Spanner is a bit more complicated, partly because lock-related operations involved in transaction validation are stretched across the whole text. It also tries to trick you by talking about details out of execution order, so make sure to always read closely for "then" to give hints on the ordering of the steps.

The execute and validate steps seem to be intertwined, as read locks are acquired while the transaction executes.
Writes are buffered until the client is ready to commit.
Two-phase commit is started to check if the transaction can commit on all participants.
After the coordinator has heard all of the minimum required timestamps from its participants during the two-phase commit’s prepare, it decides the final commit version.
The transaction is then made durable.
Finally, read and write locks are released.

Drawing this out, Spanner looks like:

Oh hey, it still looks exactly like a classic pessimistic concurrency control database. So despite the significantly more complicated explanation of how transactions are executed, it’s reasonable to approach the paper from the viewpoint of "How does this end up being equal to SERIALIZABLE MySQL?", and you can think through how the two systems differ piece by piece.

TAPIR

TAPIR ^[5] is a strictly serializable database advertising itself as an improvement on Spanner that can commit transactions with better latency and throughput through the use of its novel replication protocol. The core of TAPIR is described in §5.2.1: [5]: Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015. Building consistent transactions with inconsistent replication. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15), Association for Computing Machinery, New York, NY, USA, 263–278. [scholar]

We begin with TAPIR’s protocol for executing transactions.

For Write(key, object), the client buffers key and object in the write set until commit and returns immediately.

For Read(key), if key is in the transaction’s write set, the client returns object from the write set. If the transaction has already read key, it returns a cached copy. Otherwise, the client sends Read(key) to the replica.

On receiving Read, the replica returns object and version, where object is the latest version of key and version is the timestamp of the transaction that wrote that version.

On response, the client puts (key, version) into the transaction’s read set and returns object to the application.

Once the application calls Commit or Abort, the execution phase finishes. To commit, the TAPIR client coordinates across all participants — the shards that are responsible for the keys in the read or write set — to find a single timestamp, consistent with the strict serial order of transactions, to assign the transaction’s reads and writes, as follows:

The TAPIR client selects a proposed timestamp. Proposed timestamps must be unique, so clients use a tuple of their local time and their client id.

The TAPIR client invokes Prepare(txn, timestamp) as an IR consensus operation, where timestamp is the proposed timestamp and txn includes the transaction id (txn.id) and the transaction read (txn.read set) and write sets (txn.write set). The client invokes Prepare on all participants through IR as a consensus operations.

Each TAPIR replica that receives Prepare (invoked by IR through ExecConsensus) first checks its transaction log for txn.id. If found, it returns PREPARE-OK if the transaction committed or ABORT if the transaction aborted.

Otherwise, the replica checks if txn.id is already in its prepared list. If found, it returns PREPARE-OK.

Otherwise, the replica runs TAPIR’s OCC validation checks, which check for conflicts with the transaction’s read and write sets at timestamp, shown in Figure 8.

Once the TAPIR client receives results from all shards, the client sends Commit(txn, timestamp) if all shards replied PREPARE-OK or Abort(txn, timestamp) if any shards replied ABORT or ABSTAIN. If any shards replied RETRY, then the client retries with a new proposed timestamp (up to a set limit of retries).

On receiving a Commit, the TAPIR replica: (1) commits the transaction to its transaction log, (2) updates its versioned store with w, (3) removes the transaction from its prepared list (if it is there), and (4) responds to the client.

On receiving a Abort, the TAPIR replica: (1) logs the abort, (2) removes the transaction from its prepared list (if it is there), and (3) responds to the client.

Which initially feels like a lot of description to work through, but it breaks down into separable pieces pretty well:

The transaction is executed.
The clients picks a proposed commit timestamp.
Each replica then concurrently runs an OCC check and persists the data to the prepare log.

Thus, our diagram looks like:

This also highlights the key aspect of TAPIR: its blending of the concurrency control validation and commit outcome persistence protocols.

Tangentially, TAPIR was the inspiration behind this way of decomposing databases, as it included a nice diagram which I occasionally fell back to when reading papers I struggled to make sense of:

And this taxonomy is just adding transaction execution, and looking at how those layers are executed across a dimension of time as well.

Calvin

Calvin ^[6] is the iconic system for deterministic databases, and subsequent papers improving on various aspects of its design all share the same overall characteristics. In §3 System Architecture, Calvin’s architecture is introduced as: [6]: Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD/PODS '12), ACM. [scholar]

The essence of Calvin lies in separating the system into three separate layers of processing:

The sequencing layer (or “sequencer”) intercepts transactional inputs and places them into a global transactional input sequence—this sequence will be the order of transactions to which all replicas will ensure serial equivalence during their execution. The sequencer therefore also handles the replication and logging of this input sequence.

The scheduling layer (or “scheduler”) orchestrates transaction execution using a deterministic locking scheme to guarantee equivalence to the serial order specified by the sequencing layer while allowing transactions to be executed concurrently by a pool of transaction execution threads. (Although they are shown below the scheduler components in Figure 1, these execution threads conceptually belong to the scheduling layer.)

The storage layer handles all physical data layout. Calvin transactions access data using a simple CRUD interface; any storage engine supporting a similar interface can be plugged into Calvin fairly easily.

This means Calvin breaks down as:

Sequence the transaction into a global log.
Make the log durable.
Take locks to know when one can safely execute in the serial order despite concurrency.
Execute the transaction.
Drop all locks acquired.

Calvin is the most well known example of a database which does not execute transactions before committing them. It gains some significant advantages from this, in that its commit process is completely immune to contention in the workload, and some disadvantages, in that long running transactions will stall any later committed transactions from executing.

CURP

Exploiting Commutativity For Practical Fast Replication ^[7] defines a Consistent Unordered Replication Protocol (CURP), that allows clients to replicate requests that have not yet been ordered, as long as they are commutative. Stitching together a few paragraphs from §3.2 Normal operation: [7]: Seo Jin Park and John Ousterhout. 2019. Exploiting Commutativity For Practical Fast Replication. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), USENIX Association, Boston, MA, 47–64. [scholar]

Client interaction with masters is generally the same as it would be without CURP. Clients send update RPC requests to masters. If a client cannot receive a response, it retries the update RPC. If the master crashes, the client may retry the RPC with a different server.

For 1 RTT updates, masters return to clients before replication to backups. To ensure durability, clients directly record their requests to witnesses concurrently while waiting for responses from masters. Once all f witnesses have accepted the requests, clients are assured that the requests will survive master crashes, so clients complete the operations with the results returned from masters.

A witness accepts a new record RPC from a client only if the new operation is commutative with all operations that are currently saved in the witness. If the new request doesn’t commute with one of the existing requests, the witness must reject the record RPC since the witness has no way to order the two noncommutative operations consistent with the execution order in masters. For example, if a witness already accepted “x←1”, it cannot accept “x←5”.

Each of f witnesses operates independently; witnesses need not agree on either ordering or durability of operations. In an asynchronous network, record RPCs may arrive at witnesses in different order, which can cause witnesses to accept and reject different sets of operations. However, this does not endanger consistency. First, as mentioned in §3.2.1, a client can proceed without waiting for sync to backups only if all f witnesses accepted its record RPCs. Second, requests in each witness are required to be commutative independently, and only one witness is selected and used during recovery (described in §3.3).

The role of masters in CURP is similar to their role in traditional primary-backup replications. Masters in CURP receive, serialize, and execute all update RPC requests from clients. If an executed operation updates the system state, the master synchronizes (syncs) its current state with backups by replicating the updated value or the log of ordered operations.

Thus, we can decompose CURP into its pieces:

Clients read from the master and send the writes to both the leader and all the followers of a replication group.
Each replica concurrently checks for conflicts and records the transaction locally.
After replying to the client, the transactions are ordered.

Thus, we have:

Which also very nicely shows how CURP is rather unique: ordering transactions is the last thing that it does, and ordering transactions last is how it derives all of its advantages.

TicToc

TicToc ^[8] introduces itself as a new transaction protocol that assigns read and write timestamps to data items and uses them to lazily compute a valid commit timestamp for each transaction. Doing so removes the need for centralized timestamp allocation, and commits transactions that would be aborted by conventional timestamp ordering schemes. [8]: Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas. 2016. TicToc: Time Traveling Optimistic Concurrency Control. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD/PODS'16), ACM. [scholar]

Sounds cool. Stitching together some pieces of text from §3.2 Protocol Specification so that they read well in order, the transaction protocol is described as:

In the read phase, the DBMS maintains a separate read set and write set of tuples for each transaction. During this phase, accessed tuples are copied to the read set and modified tuples are written to the write set, which is only visible to the current transaction. Each entry in the read or write set is encoded as {tuple, data, wts, rts}, where tuple is a pointer to the tuple in the database, data is the data value of the tuple, and wts and rts are the timestamps copied from the tuple when it was accessed by the transaction. For a read set entry, TicToc maintains the invariant that the version is valid from wts to rts in timestamp order.

The first step of the validation phase is to lock all the tuples in the transaction’s write set in their primary key order to prevent other transactions from updating the rows concurrently. Using this fixed locking order guarantees that there are no deadlocks with other transactions committing at the same time.

The second step in the validation phase is to compute the transaction’s commit timestamp from the timestamps stored within each tuple entry in its read/write sets. For a tuple in the read set but not in the write set, the commit timestamp should be no less than its wts since the tuple would have a different version before this timestamp. For a tuple in the transaction’s write set, however, the commit timestamp needs to be no less than its current rts + 1 since the previous version was valid till rts.

In the last step, the algorithm validates the tuples in the transaction’s read set. If the transaction’s commit_ts is less than or equal to the rts of the read set entry, then the invariant wts ≤ commit_ts ≤ rts holds. This means that the tuple version read by the transaction is valid at commit_ts, and thus no further action is required. If the entry’s rts is less than commit_ts, however, it is not clear whether the local value is still valid or not at commit_ts. It is possible that another transaction has modified the tuple at a logical time between the local rts and commit_ts, which means the transaction has to abort. Otherwise, if no other transaction has modified the tuple, rts can be extended to be greater than or equal to commit_ts, making the version valid at commit_ts.

Finally, if all of the tuples that the transaction accessed pass validation, then the transaction enters the write phase. In this phase the transaction’s write set is written to the database.

And so breaking that down, we see:

The transaction is executed, with writes buffered until commit.
While the transaction executes, timestamps are recorded which narrow the possible range of commit versions.
Once validation begins, a final commit timestamp is chosen and checked for conflicts.
If all other steps are successful, the transaction is finally persisted.

Thus, the diagram looks something like:

TicToc does dynamic timestamp assignment. Instead of choosing a choosing a timestamp before execution, or proposing a timestamp right before commit, it narrows ranges of possible commit timestamps as it executes.

Homework

With this in mind, here’s a completely arbitrary sampling^[9] of some further systems from the top of my mind which all do transaction processing in rather different ways that you can use for practice: [9]: Feel free to send me your favorite wacky transaction processing papers for inclusion too!

Ocean Vista: Gossip-Based Visibility Control for Speedy Geo-Distributed Transactions
Chardonnay: Fast and General Datacenter Transactions for On-Disk Databases
Warp: Lightweight Multi-Key Transactions for Key-Value Stores
Extracting More Concurrency from Distributed Transactions (ROCOCO)
Granola: Low-Overhead Distributed Transaction Coordination
CockroachDB: The Resilient Geo-Distributed SQL Database
- But start with the transaction layer docs, and then add the parallel commit and read restarts optimizations as a later step.

The diagrams in this post came from my Concurrent Operation Diagram Generator tool. Feel free to use it to generate your own! There’s an interactive web version of it on the page, and you can see the source for this blog post for extra examples on how to express the diagram in text.

Some database creators have already done the homework for you and written their own blog post about how their system decomposes. Go take a look!

Composing Transactional Systems

A fun part of such a decomposition is that it can be inverted to raise fun questions. Draw any made up diagram of a possible ordering or interleaving of execute, order, validate, and persist. Now answer the question: how would I need to design a database such that it would decompose to this diagram?

From all the possible orderings and sets of concurrently executing steps, there’s at least 74 different types of databases that can exist. We’ve covered only a few of those, but each of them derives some interesting property from its different ordering of the steps of transaction processing. Looking for a novel transactional system to build and publish about? Find an ordering which hasn’t been well explored in the literature before, design a system that executes transactions in that fashion, and then figure out what it’s uniquely good and bad at.

A Big List of Every Possible Ordering

Execute -> Order -> Persist -> Validate
Execute -> Order -> Validate -> Persist
Execute -> Persist -> Order -> Validate
Execute -> Persist -> Validate -> Order
Execute -> Validate -> Order -> Persist
Execute -> Validate -> Persist -> Order
Order -> Execute -> Persist -> Validate
Order -> Execute -> Validate -> Persist
Order -> Persist -> Execute -> Validate
Order -> Persist -> Validate -> Execute
Order -> Validate -> Execute -> Persist
Order -> Validate -> Persist -> Execute
Persist -> Execute -> Order -> Validate
Persist -> Execute -> Validate -> Order
Persist -> Order -> Execute -> Validate
Persist -> Order -> Validate -> Execute
Persist -> Validate -> Execute -> Order
Persist -> Validate -> Order -> Execute
Validate -> Execute -> Order -> Persist
Validate -> Execute -> Persist -> Order
Validate -> Order -> Execute -> Persist
Validate -> Order -> Persist -> Execute
Validate -> Persist -> Execute -> Order
Validate -> Persist -> Order -> Execute

{Execute, Order} -> Persist -> Validate
Execute -> {Persist, Order} -> Validate
Execute -> Order -> {Persist, Validate}
{Execute, Order} -> Validate -> Persist
Execute -> {Order, Validate} -> Persist
{Persist, Execute} -> Order -> Validate
Execute -> Persist -> {Order, Validate}
{Persist, Execute} -> Validate -> Order
Execute -> {Persist, Validate} -> Order
{Execute, Validate} -> Order -> Persist
Execute -> Validate -> {Persist, Order}
{Execute, Validate} -> Persist -> Order
Order -> {Persist, Execute} -> Validate
Order -> Execute -> {Persist, Validate}
Order -> {Execute, Validate} -> Persist
{Persist, Order} -> Execute -> Validate
Order -> Persist -> {Execute, Validate}
{Persist, Order} -> Validate -> Execute
Order -> {Persist, Validate} -> Execute
{Order, Validate} -> Execute -> Persist
Order -> Validate -> {Persist, Execute}
{Order, Validate} -> Persist -> Execute
Persist -> {Execute, Order} -> Validate
Persist -> Execute -> {Order, Validate}
Persist -> {Execute, Validate} -> Order
Persist -> Order -> {Execute, Validate}
Persist -> {Order, Validate} -> Execute
{Persist, Validate} -> Execute -> Order
Persist -> Validate -> {Execute, Order}
{Persist, Validate} -> Order -> Execute
Validate -> {Execute, Order} -> Persist
Validate -> Execute -> {Persist, Order}
Validate -> {Persist, Execute} -> Order
Validate -> Order -> {Persist, Execute}
Validate -> {Persist, Order} -> Execute
Validate -> Persist -> {Execute, Order}

{Execute, Order} -> {Persist, Validate}
{Execute, Validate} -> {Persist, Order}
{Persist, Execute} -> {Order, Validate}
{Persist, Order} -> {Execute, Validate}
{Order, Validate} -> {Persist, Execute}
{Persist, Validate} -> {Execute, Order}

Execute -> {Order, Persist, Validate}
{Order, Persist, Validate} -> Execute
Order -> {Execute, Persist, Validate}
{Execute, Persist, Validate} -> Order
Persist -> {Execute, Order, Validate}
{Execute, Order, Validate} -> Persist
Validate -> {Execute, Order, Persist}
{Execute, Order, Persist} -> Validate

{Execute, Order, Persist, Validate}

↑ Blog ↑ Torn Write Detection and Protection ↦

See discussion of this page on Reddit, HN, and lobsters.