uint64_t numTuples|uint64_t numColumns|uint64_t T0C0|uint64_t T1C0|..|uint64_t TnC0|uint64_t T0C1|..|uint64_t TnC1|..|uint64_t TnCm
Join Processing (2018)
Archival Note
The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.
The provided code for this contest is available at transactionalblog/sigmod-contest-2018. The exact provided code is preserved as commit e825ec2b. The main
branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.
This contest was organized by Technische Universität München’s chair for Database Systems. The winner of this contest was Jianqiao Zhu, Zuyu Zhang, and Dylan Bacon from University of Wisconsin-Madison. The leaderboard has the posters and submission from the five finalists.
Task Details
The task is to evaluate batches of join queries on a set of pre-defined relations. Each join query specifies a set of relations, (equality) join predicates, and selections (aggregations). The challenge is to execute the queries as fast as possible without (much) prior indexing.
Input to your program will be provided on the standard input, and the output must appear on the standard output.
Testing Protocol
Our test harness will first feed the set of relations to your program’s standard input. That means, your program will receive multiple lines (separated by the new line character '\n') where each one contains a string which represents the file name of the given relation. The relation files are already in a binary format and thus do not require parsing. Our quick-start package already contains sample code that mmaps() a relations into main memory. The binary format of a relation consists of a header and a data section. The header contains the number of tuples and the number of columns. The data section follows the header and stores all tuples using a column storage. Hence, all of the values of a column are stored sequentially, followed by the values of the next column, and so on. The overall binary format is as follows (T0C0 stands for tuple 0 of column 0; pipe symbol '|' is not part of the binary format):
After sending the set of relations, our test harness will send a line containing the string "Done".
Next, our test harness will wait for 1s until it starts sending queries. This gives you time to prepare for the workload, e.g., sampling of the relations. The test harness sends the workload in batches: A workload batch contains a set of join queries (each line represents a query). A join query consists of three consecutive parts (separated by the pipe symbol '|'):
-
Relations: A list of relations that will be joined. We will pass the ids of the relation here separated by spaces (' '). The relation ids are implicitly mapped to the relations by the order the relations were passed in the first phase. For instance, id 0 refers to the first relation.
-
Predicates: Each predicate is separated by a '&'. We have two types of predicates: filter predicates and join predicates. Filter predicates are of the form: filter column + comparison type (greater '>' less '<' equal '=') + integer constant. Join predicates specify on which columns the relations should be joined. A join predicate is composed out of two relation-column pairs connected with an equality ('=') operator. Here, a relation is identified by its offset in the list of relations to be joined (i.e., we implicitly bind the first relation of a join query to the identifier 0, the second one to 1, etc.).
-
Projections: A list of columns that are needed to compute the final check sum that we use to verify that the join was done properly. Similar to the join predicates, columns are denoted as relation-column pairs. Each selection is delimited by a space character (' ').
Example: "0 2 4|0.1=1.2&1.0=2.1&0.1>3000|0.0 1.1"
Translated to SQL:
SELECT SUM("0".c0), SUM("1".c1) FROM r0 "0", r2 "1", r4 "2" WHERE 0.c1=1.c2 and 1.c0=2.c1 and 0.c1>3000
The end of a batch is indicated by a line containing the character 'F'. Our test harness will then wait for the results to be written to your program’s standard output. For each join query, your program is required to output a line containing the check sums of the individual projections separated by spaces (e.g., "42 4711"). If there is no qualifying tuple, each check sum should return "NULL" like in SQL. Once the results have been received, we will start delivering the next workload batch.
For your check sums, you do not have to worry about numeric overflows as long as you are using 64 bit unsigned integers.
Your solution will be evaluated for correctness and execution time. Execution time measurement starts immediately after the 1s waiting period. You are free to fully utilize the waiting period for any kind of pre-processing.
Task Hints and Constraints
-
All join graphs are connected. No cross products!
-
Cyclic queries and self joins are possible
-
The maximum number of relations per query is 4
Update (03/14/18): We have created a new (larger) public dataset for local testing. You can download it here.
Quick Start Package
We provide a quick start package in C++. It is in the format required for submission. It includes a query parser and a relation loader. This project is only meant to give you a quick start into the project and to dig right into the fun (coding) part. It is not required to use the provided code. You can create a submittable submission.tar.gz file using the included package.sh script.
For testing, we provide CSV versions (.tbl files) of each relation + SQL files to load the relations into a DBMS. We also provide a file (small.work.sql) that contains SQL versions of all queries in small.work.
Update (03/04/18): We have added a C++ reference solution to the quickstart package. It implements a basic query execution model featuring full materialization. It does not implement any query optimization. It only uses standard STL containers (like unordered_map) for the join processing. Its query processing capabilities are limited to the demands of this contest. DISCLAIMER: Although we have tested the package intensively, we cannot guarantee that it is free of bugs. Thus, we test your submissions against the results computed by a real DBMS.