Exr0n/wiki-entity-similarity

Name: Exr0n/wiki-entity-similarity
Creator: Exr0n
Published: 2022-08-19 18:51:04
License: 暂无描述

Hugging Face2022-08-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Exr0n/wiki-entity-similarity

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language: - en language_creators: - found license: - mit multilinguality: - monolingual pretty_name: 'Wiki Entity Similarity ' size_categories: - 10M<n<100M source_datasets: - original tags: - named entities - similarity - paraphrasing - synonyms - wikipedia task_categories: [] task_ids: [] --- # Wiki Entity Similarity Usage: ```py from datasets import load_dataset corpus = load_dataset('Exr0n/wiki-entity-similarity', '2018thresh20corpus', split='train') assert corpus[0] == {'article': 'A1000 road', 'link_text': 'A1000', 'is_same': 1} pairs = load_dataset('Exr0n/wiki-entity-similarity', '2018thresh20pairs', split='train') assert corpus[0] == {'article': 'Rhinobatos', 'link_text': 'Ehinobatos beurleni', 'is_same': 1} assert len(corpus) == 4_793_180 ``` ## Corpus (`name=*corpus`) The corpora in this are generated by aggregating the link text that refers to various articles in context. For instance, if wiki article A refers to article B as C, then C is added to the list of aliases for article B, and the pair (B, C) is included in the dataset. Following (DPR https://arxiv.org/pdf/2004.04906.pdf), we use the English Wikipedia dump from Dec. 20, 2018 as the source documents for link collection. The dataset includes three quality levels, distinguished by the minimum number of inbound links required to include an article in the dataset. This filtering is motivated by the heuristic "better articles have more citations." | Min. Inbound Links | Number of Articles | Number of Distinct Links | |------------|--------------------|--------------------------| | 5 | 1,080,073 | 5,787,081 | | 10 | 605,775 | 4,407,409 | | 20 | 324,949 | 3,195,545 | ## Training Pairs (`name=*pairs`) This dataset also includes training pair datasets (with both positive and negative examples) intended for training classifiers. The train/dev/test split is 75/15/10 % of each corpus. ### Training Data Generation The training pairs in this dataset are generated by taking each example from the corpus as a positive example, and creating a new negative example from the article title of the positive example and a random link text from a different article. The articles featured in each split are disjoint from the other splits, and each split has the same number of positive (semantically the same) and negative (semantically different) examples. For more details on the dataset motivation, see [the paper](https://arxiv.org/abs/2202.13581). If you use this dataset in your work, please cite it using the ArXiv reference. Generation scripts can be found [in the GitHub repo](https://github.com/Exr0nProjects/wiki-entity-similarity).

提供机构：

Exr0n

原始信息汇总

Wiki Entity Similarity 数据集概述

基本信息

名称: Wiki Entity Similarity
语言: 英语 (en)
许可证: MIT
多语言性: 单语种
大小: 10M<n<100M
来源: 原始数据集
标签: 命名实体, 相似性, 改写, 同义词, 维基百科

数据集结构

语料库 (`name=*corpus`)

生成方式: 通过聚合指向不同文章的链接文本来生成。如果维基百科文章A指向文章B作为C，则C被添加为文章B的别名，并且(B, C)对被包含在数据集中。
数据源: 使用2018年12月20日的英文维基百科转储作为链接收集的源文档。
质量级别: 根据文章的最小入站链接数分为三个级别。

最小入站链接数	文章数量	不同链接数量
5	1,080,073	5,787,081
10	605,775	4,407,409
20	324,949	3,195,545

训练对 (`name=*pairs`)

用途: 用于训练分类器的正负样本对。
数据分割: 训练/开发/测试集的比例为75/15/10%。
生成方法: 每个语料库示例作为正例，从不同文章中随机选择链接文本生成负例。
特点: 每个分割中的文章与其他分割不重叠，且正负例数量相等。

使用示例

py from datasets import load_dataset

corpus = load_dataset(Exr0n/wiki-entity-similarity, 2018thresh20corpus, split=train) pairs = load_dataset(Exr0n/wiki-entity-similarity, 2018thresh20pairs, split=train)

5,000+

优质数据集

54 个

任务类型

进入经典数据集