five

trettenmeier/markt-pilot

收藏
Hugging Face2024-04-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/trettenmeier/markt-pilot
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text-classification language: - de - en tags: - entity resolution - product matching pretty_name: Markt-Pilot Dataset for Product Matching size_categories: - 100K<n<1M --- This dataset has an accompanying paper "Introducing a novel dataset for product matching: A new challenge for matching systems" that is accepted at The 3rd International Conference on Computers and Automation (CompAuto 2023) and will be published in IEEE Xplore. The structure of the dataset is as follows: Each data point consists of a pair products and a binary label that indicates if these two product refer to the same real-world entity. It consists of four subsets that differ in size and class distribution: | Dataset |Data points | Negative | Positive | Imbalance Ratio | |---|---:|---:|---:|---:| | Full | 960,532| 665,831 | 294,701 | 2.3 | | L | 243,954| 199,749 | 44,205 | 4.5 | | M |66,556 | 59,925 | 6,631 | 9.0 | | S | 18,973 |17,978 | 995 | 18.1 | The test set consists of 5,000 manually checked data points and is shared across all four subsets.
提供机构:
trettenmeier
原始信息汇总

数据集概述

基本信息

  • 许可证: cc-by-sa-4.0
  • 任务类别: text-classification
  • 语言: de, en
  • 标签: entity resolution, product matching
  • 名称: Markt-Pilot Dataset for Product Matching
  • 大小类别: 100K<n<1M

数据集结构

  • 每个数据点包含一对产品和一个二进制标签,指示这两个产品是否指向同一个现实世界实体。

子集信息

  • 全集 (Full):
    • 数据点数: 960,532
    • 负样本数: 665,831
    • 正样本数: 294,701
    • 不平衡比率: 2.3
  • L 子集:
    • 数据点数: 243,954
    • 负样本数: 199,749
    • 正样本数: 44,205
    • 不平衡比率: 4.5
  • M 子集:
    • 数据点数: 66,556
    • 负样本数: 59,925
    • 正样本数: 6,631
    • 不平衡比率: 9.0
  • S 子集:
    • 数据点数: 18,973
    • 负样本数: 17,978
    • 正样本数: 995
    • 不平衡比率: 18.1

测试集

  • 测试集包含 5,000 个手动检查的数据点,并在所有四个子集中共享。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作