↤ Blocking System for Entity Resolution (2022) ↑ SIGMOD Programming Contest Archive ↑ Entity Resolution (2020) ↦

Entity Resolution (2021)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

There was no provided starter code for this programming contest. The provided datasets have been saved into transactionalblog/sigmod-contest-2021.

This contest was organized by the Database Research Group of the Roma Tre University. The winner of this contest was Weibao Fu, Peiqi Yin, and Lan Lu from Southern University of Science and Technology. The leaderboard has the posters and submission from the five finalists.

Task Details

The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.

Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as \(D_i\).

For each dataset \(D_i\), participants will be provided with the following resources:

\(X_i\) : a subset of the instances in \(D_i\)
\(Y_i\) : matching/non-matching labels for pairs in \(X_i \times X_i\)
\(D_i\) metadata : (e.g., how many instances it contains, what are the main characteristics)

Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Solutions will be evaluated over \(Z_i = D_i \ X_i\). Note that the instances in \(Z_i\) will not be provided to participants. More details are available in the Evaluation Process section.

Both \(X_i\) and \(Y_i\) are in CSV format.

Example of dataset \(X_i\)
instance_id	attr_name_1	attr_name_2	…	attr_name_k
00001	value_1	null	…	value_k
00002	null	value_2	…	value_k
…	…	…	…	…

Example of dataset \(Y_i\)
left_instance_id	right_instance_id	label
00001	00002	1
00001	00003	0
…	…	…

More details about the datasets can be found in the dedicated Datasets section.

Your goal is to find, for each \(X_i\) dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.

Example of output.csv
left_instance_id	right_instance_id
00001	00002
00001	00004
…	…

Datasets

#	Name	Description	Metadata	Download
1	NotebookToy	Sample notebook specifications	128 instances 16 attributes 40 entities	Dataset X1 Dataset Y1
2	Notebook	Notebook specifications	538 instances 14 attributes 100 entities	Dataset X2 Dataset Y2
3	NotebookLarge	Notebook specifications	605 instances 14 attributes 158 entities	Dataset X3 Dataset Y3
4	Altosight logo	Product specifications Kindly provided by Altosight	1356 instances 5 attributes 193 entities	Dataset X4 Dataset Y4

Name

Description

Metadata

Download

NotebookToy

Sample notebook specifications

128 instances
16 attributes
40 entities

Dataset X1
Dataset Y1

Notebook

Notebook specifications

538 instances
14 attributes
100 entities

Dataset X2
Dataset Y2

NotebookLarge

Notebook specifications

605 instances
14 attributes
158 entities

Dataset X3
Dataset Y3

Altosight logo

Product specifications
Kindly provided by Altosight

1356 instances
5 attributes
193 entities

Dataset X4
Dataset Y4

You can also download these datasets together with Snowman.

Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.

You can download the latest release, which already includes the datasets provided for the contest.

For each dataset \(D_i\) we will compute resulting F-measure with respect to Zi x Zi. Submitted solutions will be ranked on average F-measure over all datasets.

↤ Blocking System for Entity Resolution (2022) ↑ SIGMOD Programming Contest Archive ↑ Entity Resolution (2020) ↦