↤ Approximate K-nearest-neighbor Graph Construction (2023) ↑ SIGMOD Programming Contest Archive ↑ Entity Resolution (2021) ↦

Blocking System for Entity Resolution (2022)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

The provided code for this contest is available at transactionalblog/sigmod-contest-2022. The exact provided code is preserved as commit b5da3bbd. The main branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.

This contest was organized by the Chu Data Lab at the Georgia Tech and the DBGroup at the University of Modena and Reggio Emilia. The winner of this contest was Alexander Brinkmann and Ralph Peeters from University of Mannheim. The leaderboard has the posters and submission from the five finalists.

Task Description

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as \(D_i\).

For each dataset \(D_i\), participants will be provided with the following resources:

\(X_i\): a subset of the instances in \(D_i\)
\(Y_i\): matching pairs in \(X_i\) x \(X_i\). (The pairs not in \(Y_i\) are non-matching pairs.)
Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in \(Y_i\) are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, \(Y_i\) only includes (id1, id2) and doesn’t include (id2, id1).

Your goal is to write a program that generates, for each \(X_i\) dataset, a candidate set of tuple pairs for matching \(X_i\) with \(X_i\). The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the output file must be named "output.csv". The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), please only include (id1, id2) and don’t include (id2, id1) in your "output.csv".

Solutions will be evaluated over the complete dataset \(D_i\). Note that the instances in \(D_i\) (except the sample \(X_i\)) will not be provided to participants.

Both \(X_i\) and \(Y_i\) are in CSV format.

Example of dataset \(X_i\)
instance_id	attr_name_1	attr_name_2	…	attr_name_k
00001	value_1	null	…	value_k
00002	null	value_2	…	value_k
…	…	…	…	…

Example of dataset \(Y_i\)
left_instance_id	right_instance_id	label
00001	00002	1
00001	00003	0
…	…	…

More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv
left_instance_id	right_instance_id
00001	00002
00001	00004
…	…

Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. Please format "output.csv" accordingly. You can check out our provided baseline solution on how to produce a valid "output.csv".

Datasets

#	Name	Description	Number of rows	Blocking Requirements	Download Sample
1	Notebook	Notebook specifications	About 1000000	Candidate Set Size = 1000000	Dataset X1 Dataset Y1
2	Altosight	Product specifications Kindly provided by Altosight	About 1000000	Candidate Set Size = 2000000	Dataset X2 Dataset Y2

Name

Description

Number of rows

Blocking Requirements

Download Sample

Notebook

Notebook specifications

About 1000000

Candidate Set Size = 1000000

Dataset X1
Dataset Y1

Altosight

Product specifications
Kindly provided by Altosight

About 1000000

Candidate Set Size = 2000000

Dataset X2
Dataset Y2

Evaluation Process

Evaluation Metrics: For each dataset Di, we will compute resulting recall score as follows:

\[Recall = \frac{Number\ of\ true\ matches\ retained\ in\ the\ candidate\ set}{Total\ number\ of\ true\ matches\ in\ ground\ truth}\]

Note that the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) are not considered as true matches. Submitted solutions will be ranked on average Recall over all datasets. Ties will be broken with running time.

↤ Approximate K-nearest-neighbor Graph Construction (2023) ↑ SIGMOD Programming Contest Archive ↑ Entity Resolution (2021) ↦