Join Processing (2018)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

The provided code for this contest is available at transactionalblog/sigmod-contest-2018GitHub. The exact provided code is preserved as commit e825ec2b. The main branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.

This contest was organized by Technische Universität München’s chair for Database Systems. The winner of this contest was Jianqiao Zhu, Zuyu Zhang, and Dylan Bacon from University of Wisconsin-Madison. The leaderboard has the posters and submission from the five finalists.


Task Details

The task is to evaluate batches of join queries on a set of pre-defined relations. Each join query specifies a set of relations, (equality) join predicates, and selections (aggregations). The challenge is to execute the queries as fast as possible without (much) prior indexing.

Input to your program will be provided on the standard input, and the output must appear on the standard output.

Testing Protocol

Our test harness will first feed the set of relations to your program’s standard input. That means, your program will receive multiple lines (separated by the new line character '\n') where each one contains a string which represents the file name of the given relation. The relation files are already in a binary format and thus do not require parsing. Our quick-start package already contains sample code that mmaps() a relations into main memory. The binary format of a relation consists of a header and a data section. The header contains the number of tuples and the number of columns. The data section follows the header and stores all tuples using a column storage. Hence, all of the values of a column are stored sequentially, followed by the values of the next column, and so on. The overall binary format is as follows (T0C0 stands for tuple 0 of column 0; pipe symbol '|' is not part of the binary format):

uint64_t numTuples|uint64_t numColumns|uint64_t T0C0|uint64_t T1C0|..|uint64_t TnC0|uint64_t T0C1|..|uint64_t TnC1|..|uint64_t TnCm

After sending the set of relations, our test harness will send a line containing the string "Done".

Next, our test harness will wait for 1s until it starts sending queries. This gives you time to prepare for the workload, e.g., sampling of the relations. The test harness sends the workload in batches: A workload batch contains a set of join queries (each line represents a query). A join query consists of three consecutive parts (separated by the pipe symbol '|'):

Example: "0 2 4|0.1=1.2&1.0=2.1&0.1>3000|0.0 1.1"

Translated to SQL:

SELECT SUM("0".c0), SUM("1".c1)
FROM r0 "0", r2 "1", r4 "2"
WHERE 0.c1=1.c2 and 1.c0=2.c1 and 0.c1>3000

The end of a batch is indicated by a line containing the character 'F'. Our test harness will then wait for the results to be written to your program’s standard output. For each join query, your program is required to output a line containing the check sums of the individual projections separated by spaces (e.g., "42 4711"). If there is no qualifying tuple, each check sum should return "NULL" like in SQL. Once the results have been received, we will start delivering the next workload batch.

For your check sums, you do not have to worry about numeric overflows as long as you are using 64 bit unsigned integers.

Your solution will be evaluated for correctness and execution time. Execution time measurement starts immediately after the 1s waiting period. You are free to fully utilize the waiting period for any kind of pre-processing.

Task Hints and Constraints

Update (03/14/18): We have created a new (larger) public dataset for local testing. You can download it here.

Quick Start Package

We provide a quick start package in C++. It is in the format required for submission. It includes a query parser and a relation loader. This project is only meant to give you a quick start into the project and to dig right into the fun (coding) part. It is not required to use the provided code. You can create a submittable submission.tar.gz file using the included package.sh script.

For testing, we provide CSV versions (.tbl files) of each relation + SQL files to load the relations into a DBMS. We also provide a file (small.work.sql) that contains SQL versions of all queries in small.work.

Update (03/04/18): We have added a C++ reference solution to the quickstart package. It implements a basic query execution model featuring full materialization. It does not implement any query optimization. It only uses standard STL containers (like unordered_map) for the join processing. Its query processing capabilities are limited to the demands of this contest. DISCLAIMER: Although we have tested the package intensively, we cannot guarantee that it is free of bugs. Thus, we test your submissions against the results computed by a real DBMS.