Streaming N-Gram Filter (2017)

Archival Note

The original contest site is no longer accessible. There is an mirror available, from which the text on the rest of this page was copied.

The provided code for this contest was not saved by, however, a replacement is available at transactionalblog/sigmod-contest-2017GitHub. The repo is a fork of mkaufmann/sig17starterpackGitHub, the "Unofficial C++ Starter Pack for the ACM Sigmod Programming Contest 2017". (Thanks, Moritz Kaufmann!)

This contest was organized by the MaDgIK Lab of the National and Kapodistrian University of Athens and the "Athena" Research Center. The winner of this contest was Jan Böttcher, Timo Kersten, Moritz Kaufmann, and Andreas Kipf from TU Munich.

One of the finalist posters for gStreamPKU for submission Caesar11/SIGMOD-Programming-Contest-2017GitHub

Task Details

N-grams of words are often used in natural language processing and information extraction tasks. An N-gram is a contiguous sequence of N words. So for example, if we have the phrase "the book is on the table" and we want to extract all N-grams with N=3 then the N-grams would be:

In this contest, the task is to search documents and return strings from a given set, as quickly as possible. Each string represents an N-gram. We will provide an initial set of strings which you may process and index. Once this is done, we will begin issuing a workload consisting of a series of queries (documents) and N-gram updates (insertions or deletions), arbitrarily interleaved. For each N-gram insertion or deletion, the list of N-grams of interest is updated accordingly. For each new query (document) arriving, the task is to return as fast as possible the N-grams of the currently up-to-date list that are found in the document. These should be presented in order of their first appearance in the document. If one N-gram is a prefix of another and the larger one is in the document, then the shorter one is presented first.

Input to your program will be provided on the standard input, and the output must appear on the standard output.

Testing Protocol

Our test harness will first feed the initial set of N-grams to your program’s standard input. Your program will receive multiple lines where each one contains a string which represents an N-gram. The initial set ends with a line containing the character 'S'.

After sending the initial set of strings, our test harness will monitor your program’s standard output for a line containing the character 'R' (case insensitive, followed by the new line character '\n'). Your program uses this line to signal that it is done ingesting the initial set of N-grams, has performed any processing and/or indexing on it and is now ready to receive the workload.

The test harness delivers the workload in batches. Each batch consists of a sequence of operations provided one per line followed by a line containing the single character 'F' that signals the end of the batch. Each operation is represented by one character ('Q', 'A' or 'D') that defines the operation type, followed by a space and either an N-gram or a document.

The three operation types are as follows:

After the end of every batch, our test harness will wait for output from your program before providing the next batch. You need to provide as many lines of output as the number of the query ('Q') operations in the batch - each line containing the ordered N-grams that have matched separated by '|'. Your program is free to process operations in a batch concurrently. However, the query results in the output must reflect the order of the queries within the batch.

After the last batch, the standard input stream of your program will be closed by our test harness.

Your solution will be evaluated for correctness and execution time. Execution time measurement does not start until your program signals (with 'R') that it is finished ingesting the initial set of strings. Thus, you are free to pre-process or index the N-grams as you see fit without penalty, as long as your program runs within the overall testing time limit. Concurrent request execution within each batch is allowed and encouraged, as long as the results mimic a sequential execution of the operations within the batch. In particular, the result for each query must reflect all additions and deletions that precede it in the workload sequence, and must not reflect any additions and deletions that follow it.