128 instances
16 attributes
40 entities
Entity Resolution (2021)
Archival Note
The original contest site is still available! If it’s unavailable in the future, it can be found on an Archive.org mirror instead.
There was no provided starter code for this programming contest. The provided datasets have been saved into transactionalblog/sigmod-contest-2021.
This contest was organized by the Database Research Group of the Roma Tre University. The winner of this contest was Weibao Fu, Peiqi Yin, and Lan Lu from Southern University of Science and Technology. The leaderboard has the posters and submission from the five finalists.
Task Details
The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.
Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as \(D_i\).
For each dataset \(D_i\), participants will be provided with the following resources:
-
\(X_i\) : a subset of the instances in \(D_i\)
-
\(Y_i\) : matching/non-matching labels for pairs in \(X_i \times X_i\)
-
\(D_i\) metadata : (e.g., how many instances it contains, what are the main characteristics)
Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).
Solutions will be evaluated over \(Z_i = D_i \ X_i\). Note that the instances in \(Z_i\) will not be provided to participants. More details are available in the Evaluation Process section.
Both \(X_i\) and \(Y_i\) are in CSV format.
instance_id |
attr_name_1 |
attr_name_2 |
… |
attr_name_k |
---|---|---|---|---|
00001 |
value_1 |
null |
… |
value_k |
00002 |
null |
value_2 |
… |
value_k |
… |
… |
… |
… |
… |
left_instance_id |
right_instance_id |
label |
---|---|---|
00001 |
00002 |
1 |
00001 |
00003 |
0 |
… |
… |
… |
More details about the datasets can be found in the dedicated Datasets section.
Your goal is to find, for each \(X_i\) dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.
left_instance_id |
right_instance_id |
00001 |
00002 |
00001 |
00004 |
… |
… |
Datasets
# |
Name |
Description |
Metadata |
Download |
---|---|---|---|---|
1 |
NotebookToy |
Sample notebook specifications |
||
2 |
Notebook |
Notebook specifications |
538 instances |
|
3 |
NotebookLarge |
Notebook specifications |
605 instances |
|
4 |
Altosight logo |
Product specifications |
1356 instances |
You can also download these datasets together with Snowman.
Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.
You can download the latest release, which already includes the datasets provided for the contest.
For each dataset \(D_i\) we will compute resulting F-measure with respect to Zi x Zi. Submitted solutions will be ranked on average F-measure over all datasets.