Entity Resolution (2021)

Archival Note

The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.

There was no provided starter code for this programming contest. The provided datasets have been saved into transactionalblog/sigmod-contest-2021GitHub.

This contest was organized by the Database Research Group of the Roma Tre University. The winner of this contest was Weibao Fu, Peiqi Yin, and Lan Lu from Southern University of Science and Technology. The leaderboard has the posters and submission from the five finalists.


Task Details

The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.

Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as \(D_i\).

For each dataset \(D_i\), participants will be provided with the following resources:

Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).

Solutions will be evaluated over \(Z_i = D_i \ X_i\). Note that the instances in \(Z_i\) will not be provided to participants. More details are available in the Evaluation Process section.

Both \(X_i\) and \(Y_i\) are in CSV format.

Example of dataset \(X_i\)

instance_id

attr_name_1

attr_name_2

…​

attr_name_k

00001

value_1

null

…​

value_k

00002

null

value_2

…​

value_k

…​

…​

…​

…​

…​

Example of dataset \(Y_i\)

left_instance_id

right_instance_id

label

00001

00002

1

00001

00003

0

…​

…​

…​

More details about the datasets can be found in the dedicated Datasets section.

Your goal is to find, for each \(X_i\) dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.

Example of output.csv

left_instance_id

right_instance_id

00001

00002

00001

00004

…​

…​

Datasets

#

Name

Description

Metadata

Download

1

NotebookToy

Sample notebook specifications

128 instances
16 attributes
40 entities

2

Notebook

Notebook specifications

538 instances
14 attributes
100 entities

3

NotebookLarge

Notebook specifications

605 instances
14 attributes
158 entities

4

Altosight logo

Product specifications
Kindly provided by Altosight

1356 instances
5 attributes
193 entities

You can also download these datasets together with Snowman.

Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.

You can download the latest release, which already includes the datasets provided for the contest.

For each dataset \(D_i\) we will compute resulting F-measure with respect to Zi x Zi. Submitted solutions will be ranked on average F-measure over all datasets.