mnemlaghi/widdd

Name: mnemlaghi/widdd
Creator: mnemlaghi
Published: 2024-04-29 14:11:04
License: 暂无描述

Hugging Face2024-04-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mnemlaghi/widdd

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - machine-generated language: - en license: - apache-2.0 multilinguality: - monolingual pretty_name: Wikidisamb Dataset with Descriptions size_categories: - 100K<n<1M source_datasets: [] task_categories: - text-retrieval - token-classification task_ids: - entity-linking-retrieval --- # Dataset Card for "Widdd" ## Dataset Description WiDDD stands for WIkiData Disambig with Descriptions. The former dataset comes from [Cetoli & al](https://arxiv.org/pdf/1810.09164.pdf) paper, and is aimed at solving Named Entity Disambiguation. This datasets tries to extract relevant information from entities descriptions only, instead of working with graphs. In order to do so, we mapped every Wikidata id (correct id and wrong id) in the original paper with its WikiData description. If not found, row is discarded for the 1.+ versions. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages english ## Dataset Structure We show detailed information for up to 5 configurations of the dataset. ### Data Instances #### plain_text - **Size of downloaded dataset files:** 46.64 MB An example of 'train' looks as follows. ``` {'example_id': 11, 'string': 'pausanias', 'text': ' mention the spear, which he would indeed have touched with excitement. But it was being shown in the time of Pausanias in the second century AD. Achilles and ', 'correct_id': 'Q192931', 'wrong_id': 'Q941521', 'correct_description': 'ancient Greek geographer, travel writer and mythographer', 'wrong_description': 'Wikimedia disambiguation page'} ``` ### Data Fields The data fields are the same among all splits. #### plain_text - `example_id`: an `int32` feature, - `string`: a `string` feature, - `text`: a `string` feature, - `correct_id`: a `string` feature, - `wrong_id`: a `string` feature, - `correct_description`: a `string` feature, - `wrong_description`: a `string` feature, ### Data Splits | name |train|validation|test| |----------|----:|-----:|-----:| |plain_text|96523|9609|9584| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ### Contributions

--- annotations_creators: - 机器生成 language_creators: - 机器生成 language: - en license: - Apache-2.0 multilinguality: - 单语言 pretty_name: 带描述的Wikidisamb数据集（Wikidisamb Dataset with Descriptions） size_categories: - 100K<n<1M source_datasets: [] task_categories: - 文本检索（text-retrieval） - 令牌分类（token-classification） task_ids: - 实体链接检索（entity-linking-retrieval） --- # 「Widdd」数据集卡片 ## 数据集描述 WiDDD即**带描述的维基数据消歧数据集（WIkiData Disambig with Descriptions）**。该原始数据集源自Cetoli等人的论文[Cetoli & al](https://arxiv.org/pdf/1810.09164.pdf)，旨在解决命名实体消歧（Named Entity Disambiguation）任务。本数据集仅从实体描述中提取相关信息，而非基于知识图谱开展研究。为此，我们将原论文中的每一个维基数据（Wikidata）ID（包含正确ID与错误ID）映射至其对应的维基数据描述信息；若无法找到对应描述，则在1.+版本的数据集中剔除该条数据。 ### 支持的任务与排行榜 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言英语 ## 数据集结构我们将展示该数据集最多5种配置的详细信息。 ### 数据实例 #### 纯文本格式 - **下载数据集总大小：** 46.64 MB 「训练集」的一条示例数据如下所示： {'example_id': 11, 'string': 'pausanias', 'text': ' mention the spear, which he would indeed have touched with excitement. But it was being shown in the time of Pausanias in the second century AD. Achilles and ', 'correct_id': 'Q192931', 'wrong_id': 'Q941521', 'correct_description': 'ancient Greek geographer, travel writer and mythographer', 'wrong_description': 'Wikimedia disambiguation page'} ### 数据字段所有数据集拆分均采用统一的数据字段格式。 #### 纯文本格式 - `example_id`: `int32` 类型特征, - `string`: 字符串类型特征, - `text`: 字符串类型特征, - `correct_id`: 字符串类型特征, - `wrong_id`: 字符串类型特征, - `correct_description`: 字符串类型特征, - `wrong_description`: 字符串类型特征, ### 数据拆分 | 配置名称 |训练集|验证集|测试集| |----------|----:|-----:|-----:| |纯文本|96523|9609|9584| ## 数据集构建 ### 构建基本原理 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释 #### 注释流程 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 注释者是谁？ [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据使用注意事项 ### 数据集的社会影响 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集策展人 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 ### 贡献

提供机构：

mnemlaghi

原始信息汇总

数据集概述

数据集名称

名称：Wikidisamb Dataset with Descriptions
别名：WiDDD

数据集描述

目的：解决命名实体消歧问题，通过提取实体描述中的相关信息，而非使用图结构。
来源：基于Cetoli & al的论文，该数据集将原始论文中的每个Wikidata ID（正确和错误的ID）与其Wikidata描述映射。

数据集特征

语言：英语
许可证：Apache-2.0
多语言性：单语
大小：100K<n<1M

数据集结构

数据实例：
- 示例包含字段：example_id, string, text, correct_id, wrong_id, correct_description, wrong_description
- 示例大小：46.64 MB
数据字段：
- example_id: int32
- string: string
- text: string
- correct_id: string
- wrong_id: string
- correct_description: string
- wrong_description: string
数据分割：
- 训练集：96523
- 验证集：9609
- 测试集：9584

任务类别

文本检索
令牌分类

任务ID

实体链接检索

搜集汇总

数据集介绍

构建方式

在实体消歧研究领域，WiDDD数据集的构建体现了对传统图结构方法的创新性突破。该数据集源自Cetoli等人论文中的原始数据，旨在通过实体描述信息解决命名实体消歧问题。构建过程中，研究团队将原始数据中的每个维基数据标识符，包括正确标识与错误标识，映射至其对应的维基数据描述文本。若描述信息缺失，相应数据行在1.0及以上版本中会被剔除，从而确保了数据质量与一致性，最终形成了规模约十万条记录的英文单语数据集。

使用方法

使用WiDDD数据集时，研究者可将其应用于实体链接检索或令牌分类任务。典型流程包括加载数据分割，利用提供的实体字符串、上下文文本及描述字段，训练模型区分正确与错误的实体标识。数据字段如example_id、string、text、correct_id、wrong_id及其描述均清晰定义，支持端到端的模型开发与评估。该数据集可直接通过HuggingFace平台获取，其结构化格式便于集成至现有自然语言处理流程中，推动实体消歧技术的进步。

背景与挑战

背景概述

在自然语言处理领域，命名实体消歧是一项核心任务，旨在将文本中提及的实体链接到知识库中的唯一标识符。WiDDD（WikiData Disambig with Descriptions）数据集由Cetoli等人于2018年提出，其创新之处在于摒弃了传统的图结构方法，转而专注于利用实体描述信息进行消歧。该数据集基于Wikidata构建，通过映射每个实体的正确与错误ID及其对应描述，为研究者提供了一个纯文本驱动的消歧基准。这一方法不仅简化了数据处理流程，还推动了实体链接任务向更高效的描述性表征方向发展，对信息检索和语义理解领域产生了深远影响。

当前挑战

WiDDD数据集面临的挑战主要体现在两个方面：在领域问题层面，命名实体消歧本身具有高度复杂性，尤其是当实体描述信息存在歧义或语义重叠时，模型难以准确区分相似实体，这要求算法具备深层次的语义理解能力。在构建过程中，数据集的创建依赖于Wikidata的描述信息映射，但部分实体描述缺失或质量不均，导致原始数据被大量丢弃，影响了数据集的完整性和覆盖范围。此外，纯文本描述方法虽简化了结构，却可能忽略实体间的隐含关系，为消歧任务增添了额外的语义解析难度。

常用场景

经典使用场景

在自然语言处理领域，命名实体消歧是理解文本语义的关键环节。WiDDD数据集通过提供实体描述信息，为基于文本的实体链接任务提供了经典应用场景。研究者利用该数据集训练模型，使其能够依据上下文和实体描述，准确区分同名实体的不同指代，从而提升知识图谱构建和信息检索的精度。

解决学术问题

该数据集主要针对命名实体消歧中的核心挑战，即如何仅依赖文本描述而非图结构信息来区分歧义实体。它解决了传统方法对复杂知识图谱的依赖问题，推动了纯文本驱动的实体消歧研究。其意义在于验证了描述性文本在语义理解中的有效性，为轻量级、可扩展的消歧模型提供了实验基础，促进了自然语言理解与知识表示学习的交叉发展。

实际应用

在实际应用中，WiDDD数据集支撑了搜索引擎、智能问答系统和内容推荐引擎的优化。通过提升实体识别的准确性，它帮助系统更精准地理解用户查询意图，改善搜索结果的相关性。在数字图书馆和档案管理领域，该数据集有助于自动化标注和组织文献资源，增强信息服务的智能化水平。

数据集最近研究

社区讨论

#经验分享

也还有其他访问渠道： ArcGIS 在线服务：如需在ArcGIS平台中使用，可以直接访问其FeatureServer。 FeatureServer URL: https://services6.arcgis.com/EbVsqZ18sv1kVJ3k/arcgis/rest/services/NYS_Civil_Boundaries/FeatureServer

5,000+

优质数据集

54 个

任务类型

进入经典数据集