trettenmeier/markt-pilot
收藏Hugging Face2024-04-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/trettenmeier/markt-pilot
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-classification
language:
- de
- en
tags:
- entity resolution
- product matching
pretty_name: Markt-Pilot Dataset for Product Matching
size_categories:
- 100K<n<1M
---
This dataset has an accompanying paper "Introducing a novel dataset for product matching: A new challenge for matching systems" that is accepted at The 3rd International Conference on Computers and Automation (CompAuto 2023) and will be published in IEEE Xplore.
The structure of the dataset is as follows: Each data point consists of a pair products and a binary label that indicates if these two product refer to the same real-world entity.
It consists of four subsets that differ in size and class distribution:
| Dataset |Data points | Negative | Positive | Imbalance Ratio |
|---|---:|---:|---:|---:|
| Full | 960,532| 665,831 | 294,701 | 2.3 |
| L | 243,954| 199,749 | 44,205 | 4.5 |
| M |66,556 | 59,925 | 6,631 | 9.0 |
| S | 18,973 |17,978 | 995 | 18.1 |
The test set consists of 5,000 manually checked data points and is shared across all four subsets.
提供机构:
trettenmeier
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-sa-4.0
- 任务类别: text-classification
- 语言: de, en
- 标签: entity resolution, product matching
- 名称: Markt-Pilot Dataset for Product Matching
- 大小类别: 100K<n<1M
数据集结构
- 每个数据点包含一对产品和一个二进制标签,指示这两个产品是否指向同一个现实世界实体。
子集信息
- 全集 (Full):
- 数据点数: 960,532
- 负样本数: 665,831
- 正样本数: 294,701
- 不平衡比率: 2.3
- L 子集:
- 数据点数: 243,954
- 负样本数: 199,749
- 正样本数: 44,205
- 不平衡比率: 4.5
- M 子集:
- 数据点数: 66,556
- 负样本数: 59,925
- 正样本数: 6,631
- 不平衡比率: 9.0
- S 子集:
- 数据点数: 18,973
- 负样本数: 17,978
- 正样本数: 995
- 不平衡比率: 18.1
测试集
- 测试集包含 5,000 个手动检查的数据点,并在所有四个子集中共享。



