Distributed Query Engine (2010)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

The provided code for this contest is available at transactionalblog/sigmod-contest-2010GitHub. The exact provided code is preserved as commit 7d3bbde8. The main branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.

This contest was organized by Pierre Senellart of Télécom ParisTech, along with 2009’s winner Clément Genzmer. The authors followed up with an overview report on the contest and approaches taken by the finalist teams.


Task Details

Given a parsed SQL query, you have to return the right results as fast as possible. The data is stored on disk, the indexes are all in memory. The SQL queries always has the following form:

SELECT alias_name.field_name, ...
FROM table_name AS alias_name, ...
WHERE condition1 AND ... AND conditionN

A condition may be either:

The data is distributed on multiple nodes, and can be replicated. The distribution of data is horizontal: a given row of a table is never fragmented. The implementation of the indexes is provided and cannot be changed. Up to 50 queries are sent at the same time by 50 different threads, but only the total amount of time is measured. You do not have to take care of the partitioning, replication or creation of the indexes: these are done before the beginning of the benchmark of your client.

Before the actual start of the benchmarks, you are given a predefined number of seconds to run some preprocessing on the data. You are also given a set of queries which is representative of the benchmark to help you run the preprocessing.

There are 7 methods to implement. There are fully described in the client.h file. The following diagrams show the way they are called. [1][1]: It’s in my TODO list to rename the code and diagrams away from master/slave as the terminology.

Initialization Phase
phase1
Connection Phase
phase2
Query Phase
phase3
Closing Phase
phase4

For more details, see the README file inside the GitHub repo.