Approximate K-nearest-neighbor Graph Construction (2023)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

The provided code for this contest is available at transactionalblog/sigmod-contest-2023GitHub. The exact provided code is preserved as commit d8ee7712. The main branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.

This contest was organized by the DB Group at Rutgers University and the Chu Data Lab at Georgia Tech. The winner of this contest was Jiayi Wang from Tsinghua University. The leaderboard has the posters and submission from the five finalists.


Task Details

The task is to build an approximate K-NN Graph for a set of vectors. i.e., for each vector, find its approximate k nearest neighbors in a limited time. For this year’s task, k is set to be 100.

A sample dataset will be provided. It contains millions of high-dimensional vectors. We will refer to it as D.

The data in Dataset D is in a binary format that starts with a 4-byte integer, num_vectors(uint32_t), followed by num_vectors x 100 (num_dimension) x sizeof(float32) bytes of data stored one vector after another.

We provide both the reading function(ReadBin) to load the dataset and the writing function(SaveKNNG) to generate the output file in the io.h file.

Your goal is to design an efficient algorithm for approximate K-NN Graph construction. For each vector, your output must contain k consecutive ids of the nearest neighbors determined by your algorithm. These neighbor lists are stored one by one and stored in a binary file.

During evaluation, we will replace the sample dataset D with a hidden test dataset. The hidden test dataset is randomly drew from the same dataset where D was sampled from. We will evaluate your algorithms using the hidden test dataset.

output.bin format: The evaluation process expects "output.bin" to be a binary file containing |S| x 100 x id(uint32_t). |S| is the number of random sample vectors in dataset S, 100 is the number of nearest neighbors and id is the index of 100-nearest neighbors in the given dataset S.

Please format "output.bin" accordingly. You can check out our provided baseline solution on how to produce a valid "output.bin".

Datasets

All of our datasets (released and evaluation dataset) are sampled from the same billion-scale vector dataset, which consists of Bing queries encoded by Turing AGI v5 that trains Transformers to capture similarity of intent in web search queries.

#

Name

Description

Size

1

dummy-data.bin

dummy data for packing submission in reprozip

10^4

2

contest-data-release-1m.bin

medium scale released data

10^6

3

contest-data-release-10m.bin

large scale released data

10^7

4

secret-1m.bin

medium scale data, used for evaluation before March 10

10^6

5

secret-10m.bin

large scale data, used for evaluation after March 10

10^7