Blocking System for Entity Resolution (2022)
Archival Note
The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.
The provided code for this contest is available at transactionalblog/sigmod-contest-2022. The exact provided code is preserved as commit b5da3bbd. The main
branch contains changes make to fix build issues, improve the build system, update instructions, etc. Links to code in the copied text below have been changed to point to the GitHub repo.
This contest was organized by the Chu Data Lab at the Georgia Tech and the DBGroup at the University of Modena and Reggio Emilia. The winner of this contest was Alexander Brinkmann and Ralph Peeters from University of Mannheim. The leaderboard has the posters and submission from the five finalists.
Task Description
The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.
Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as \(D_i\).
For each dataset \(D_i\), participants will be provided with the following resources:
-
\(X_i\): a subset of the instances in \(D_i\)
-
\(Y_i\): matching pairs in \(X_i\) x \(X_i\). (The pairs not in \(Y_i\) are non-matching pairs.)
-
Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)
Note that matching pairs in \(Y_i\) are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, \(Y_i\) only includes (id1, id2) and doesn’t include (id2, id1).
Your goal is to write a program that generates, for each \(X_i\) dataset, a candidate set of tuple pairs for matching \(X_i\) with \(X_i\). The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the output file must be named "output.csv". The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), please only include (id1, id2) and don’t include (id2, id1) in your "output.csv".
Solutions will be evaluated over the complete dataset \(D_i\). Note that the instances in \(D_i\) (except the sample \(X_i\)) will not be provided to participants.
Both \(X_i\) and \(Y_i\) are in CSV format.
instance_id |
attr_name_1 |
attr_name_2 |
… |
attr_name_k |
---|---|---|---|---|
00001 |
value_1 |
null |
… |
value_k |
00002 |
null |
value_2 |
… |
value_k |
… |
… |
… |
… |
… |
left_instance_id |
right_instance_id |
label |
---|---|---|
00001 |
00002 |
1 |
00001 |
00003 |
0 |
… |
… |
… |
More details about the datasets can be found in the dedicated Datasets section.
left_instance_id |
right_instance_id |
---|---|
00001 |
00002 |
00001 |
00004 |
… |
… |
Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. Please format "output.csv" accordingly. You can check out our provided baseline solution on how to produce a valid "output.csv".
Datasets
# |
Name |
Description |
Number of rows |
Blocking Requirements |
Download Sample |
---|---|---|---|---|---|
1 |
Notebook |
Notebook specifications |
About 1000000 |
Candidate Set Size = 1000000 |
|
2 |
Altosight |
Product specifications |
About 1000000 |
Candidate Set Size = 2000000 |
Evaluation Process
Evaluation Metrics: For each dataset Di, we will compute resulting recall score as follows:
Note that the trivial equi-joins (tuple pairs with left_instance_id
= right_instance_id
) are not considered as true matches. Submitted solutions will be ranked on average Recall over all datasets. Ties will be broken with running time.