Blocking System for Entity Resolution (2022)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

The provided code for this contest is available at transactionalblog/sigmod-contest-2022GitHub. The exact provided code is preserved as commit b5da3bbd. The main branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.

This contest was organized by the Chu Data Lab at the Georgia Tech and the DBGroup at the University of Modena and Reggio Emilia. The winner of this contest was Alexander Brinkmann and Ralph Peeters from University of Mannheim. The leaderboard has the posters and submission from the five finalists.


Task Description

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as \(D_i\).

For each dataset \(D_i\), participants will be provided with the following resources:

Note that matching pairs in \(Y_i\) are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, \(Y_i\) only includes (id1, id2) and doesn’t include (id2, id1).

Your goal is to write a program that generates, for each \(X_i\) dataset, a candidate set of tuple pairs for matching \(X_i\) with \(X_i\). The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the output file must be named "output.csv". The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), please only include (id1, id2) and don’t include (id2, id1) in your "output.csv".

Solutions will be evaluated over the complete dataset \(D_i\). Note that the instances in \(D_i\) (except the sample \(X_i\)) will not be provided to participants.

Both \(X_i\) and \(Y_i\) are in CSV format.

Example of dataset \(X_i\)

instance_id

attr_name_1

attr_name_2

…​

attr_name_k

00001

value_1

null

…​

value_k

00002

null

value_2

…​

value_k

…​

…​

…​

…​

…​

Example of dataset \(Y_i\)

left_instance_id

right_instance_id

label

00001

00002

1

00001

00003

0

…​

…​

…​

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv

left_instance_id

right_instance_id

00001

00002

00001

00004

…​

…​

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. Please format "output.csv" accordingly. You can check out our provided baseline solution on how to produce a valid "output.csv".

Datasets

#

Name

Description

Number of rows

Blocking Requirements

Download Sample

1

Notebook

Notebook specifications

About 1000000

Candidate Set Size = 1000000

2

Altosight

Product specifications
Kindly provided by Altosight

About 1000000

Candidate Set Size = 2000000

Evaluation Process

Evaluation Metrics: For each dataset Di, we will compute resulting recall score as follows:

\[Recall = \frac{Number\ of\ true\ matches\ retained\ in\ the\ candidate\ set}{Total\ number\ of\ true\ matches\ in\ ground\ truth}\]

Note that the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) are not considered as true matches. Submitted solutions will be ranked on average Recall over all datasets. Ties will be broken with running time.