midas/duc2001

Name: midas/duc2001
Creator: midas
Published: 2022-01-23 06:13:06
License: 暂无描述

Hugging Face2022-01-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/midas/duc2001

下载链接

链接失效反馈

官方服务：

资源简介：

## Dataset Summary A dataset for benchmarking keyphrase extraction and generation techniques from english news articles. For more details about the dataset please refer the original paper - [https://dl.acm.org/doi/10.5555/1620163.1620205](https://dl.acm.org/doi/10.5555/1620163.1620205) Original source of the data - []() ## Dataset Structure ### Data Fields - **id**: unique identifier of the document. - **document**: Whitespace separated list of words in the document. - **doc_bio_tags**: BIO tags for each word in the document. B stands for the beginning of a keyphrase and I stands for inside the keyphrase. O stands for outside the keyphrase and represents the word that isn't a part of the keyphrase at all. - **extractive_keyphrases**: List of all the present keyphrases. - **abstractive_keyphrase**: List of all the absent keyphrases. ### Data Splits |Split| #datapoints | |--|--| | Test | 308 | ## Usage ### Full Dataset ```python from datasets import load_dataset # get entire dataset dataset = load_dataset("midas/duc2001", "raw") # sample from the test split print("Sample from test dataset split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` **Output** ```bash Sample from test data split Fields in the sample: ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'] Tokenized Document: ['Here', ',', 'at', 'a', 'glance', ',', 'are', 'developments', 'today', 'involving', 'the', 'crash', 'of', 'Pan', 'American', 'World', 'Airways', 'Flight', '103', 'Wednesday', 'night', 'in', 'Lockerbie', ',', 'Scotland', ',', 'that', 'killed', 'all', '259', 'people', 'aboard', 'and', 'more', 'than', '20', 'people', 'on', 'the', 'ground', ':'] Document BIO Tags: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] Extractive/present Keyphrases: ['pan american world airways flight 103', 'crash', 'lockerbie'] Abstractive/absent Keyphrases: ['terrorist threats', 'widespread wreckage', 'radical palestinian faction', 'terrorist bombing', 'bomb threat', 'sabotage'] ----------- ``` ### Keyphrase Extraction ```python from datasets import load_dataset # get the dataset only for keyphrase extraction dataset = load_dataset("midas/duc2001", "extraction") print("Samples for Keyphrase Extraction") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Document BIO Tags: ", test_sample["doc_bio_tags"]) print("\n-----------\n") ``` ### Keyphrase Generation ```python # get the dataset only for keyphrase generation dataset = load_dataset("midas/duc2001", "generation") print("Samples for Keyphrase Generation") # sample from the test split print("Sample from test data split") test_sample = dataset["test"][0] print("Fields in the sample: ", [key for key in test_sample.keys()]) print("Tokenized Document: ", test_sample["document"]) print("Extractive/present Keyphrases: ", test_sample["extractive_keyphrases"]) print("Abstractive/absent Keyphrases: ", test_sample["abstractive_keyphrases"]) print("\n-----------\n") ``` ## Citation Information ``` @inproceedings{10.5555/1620163.1620205, author = {Wan, Xiaojun and Xiao, Jianguo}, title = {Single Document Keyphrase Extraction Using Neighborhood Knowledge}, year = {2008}, isbn = {9781577353683}, publisher = {AAAI Press}, booktitle = {Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2}, pages = {855–860}, numpages = {6}, location = {Chicago, Illinois}, series = {AAAI'08} } ``` ## Contributions Thanks to [@debanjanbhucs](https://github.com/debanjanbhucs), [@dibyaaaaax](https://github.com/dibyaaaaax) and [@ad6398](https://github.com/ad6398) for adding this dataset

## 数据集概述本数据集用于基准测试来自英文新闻文章的关键词短语（keyphrase）抽取与生成技术。如需了解该数据集的更多细节，请参阅原始论文：[https://dl.acm.org/doi/10.5555/1620163.1620205](https://dl.acm.org/doi/10.5555/1620163.1620205) 数据集的原始来源：[]() ## 数据集结构 ### 数据字段 - **id**：文档的唯一标识符。 - **document**：文档中以空格分隔的单词列表。 - **doc_bio_tags**：文档中每个单词对应的BIO标签（BIO tags）。其中B代表关键词短语（keyphrase）的起始位置，I代表关键词短语的内部位置，O代表非关键词短语位置，即该单词完全不属于任何关键词短语。 - **extractive_keyphrases**：所有存在于原文中的关键词短语列表。 - **abstractive_keyphrases**：所有未出现在原文中的关键词短语列表。 ### 数据拆分 |拆分|数据点数量| |--|--| | 测试集 | 308 | ## 使用方法 ### 完整数据集 python from datasets import load_dataset # 获取完整数据集 dataset = load_dataset("midas/duc2001", "raw") # 从测试拆分中采样示例 print("测试数据集拆分采样示例") test_sample = dataset["test"][0] print("样本包含的字段：", [key for key in test_sample.keys()]) print("分词后的文档：", test_sample["document"]) print("文档BIO标签：", test_sample["doc_bio_tags"]) print("抽取式/存在的关键词短语：", test_sample["extractive_keyphrases"]) print("生成式/缺失的关键词短语：", test_sample["abstractive_keyphrases"]) print(" ----------- ") **输出** bash 测试数据集拆分采样示例样本包含的字段： ['id', 'document', 'doc_bio_tags', 'extractive_keyphrases', 'abstractive_keyphrases', 'other_metadata'] 分词后的文档： ['Here', ',', 'at', 'a', 'glance', ',', 'are', 'developments', 'today', 'involving', 'the', 'crash', 'of', 'Pan', 'American', 'World', 'Airways', 'Flight', '103', 'Wednesday', 'night', 'in', 'Lockerbie', ',', 'Scotland', ',', 'that', 'killed', 'all', '259', 'people', 'aboard', 'and', 'more', 'than', '20', 'people', 'on', 'the', 'ground', ':'] 文档BIO标签： ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'O', 'O', 'O', 'B', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] 抽取式/存在的关键词短语： ['pan american world airways flight 103', 'crash', 'lockerbie'] 生成式/缺失的关键词短语： ['terrorist threats', 'widespread wreckage', 'radical palestinian faction', 'terrorist bombing', 'bomb threat', 'sabotage'] ----------- ### 关键词短语抽取 python from datasets import load_dataset # 仅加载用于关键词短语抽取任务的数据集 dataset = load_dataset("midas/duc2001", "extraction") print("关键词短语抽取任务示例") # 从测试拆分中采样示例 print("测试数据集拆分采样示例") test_sample = dataset["test"][0] print("样本包含的字段：", [key for key in test_sample.keys()]) print("分词后的文档：", test_sample["document"]) print("文档BIO标签：", test_sample["doc_bio_tags"]) print(" ----------- ") ### 关键词短语生成 python # 仅加载用于关键词短语生成任务的数据集 dataset = load_dataset("midas/duc2001", "generation") print("关键词短语生成任务示例") # 从测试拆分中采样示例 print("测试数据集拆分采样示例") test_sample = dataset["test"][0] print("样本包含的字段：", [key for key in test_sample.keys()]) print("分词后的文档：", test_sample["document"]) print("抽取式/存在的关键词短语：", test_sample["extractive_keyphrases"]) print("生成式/缺失的关键词短语：", test_sample["abstractive_keyphrases"]) print(" ----------- ") ## 引用信息 @inproceedings{10.5555/1620163.1620205, author = {Wan, Xiaojun and Xiao, Jianguo}, title = {Single Document Keyphrase Extraction Using Neighborhood Knowledge}, year = {2008}, isbn = {9781577353683}, publisher = {AAAI Press}, booktitle = {Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2}, pages = {855–860}, numpages = {6}, location = {Chicago, Illinois}, series = {AAAI'08} } ## 贡献感谢 [@debanjanbhucs](https://github.com/debanjanbhucs)、[@dibyaaaaax](https://github.com/dibyaaaaax) 与 [@ad6398](https://github.com/ad6398) 贡献本数据集

提供机构：

midas

原始信息汇总

数据集概述

本数据集用于评估英语新闻文章的关键词提取和生成技术。

数据集结构

数据字段

id: 文档的唯一标识符。
document: 文档中的单词以空格分隔的列表。
doc_bio_tags: 文档中每个单词的BIO标签。B表示关键词的开始，I表示关键词内部，O表示不属于关键词的单词。
extractive_keyphrases: 当前存在的所有关键词列表。
abstractive_keyphrase: 当前不存在的所有关键词列表。

数据分割

分割	数据点数量
测试	308

使用示例

完整数据集

python from datasets import load_dataset

获取整个数据集

dataset = load_dataset("midas/duc2001", "raw")

从测试分割中抽样

print("测试数据分割的样本") test_sample = dataset["test"][0] print("样本中的字段: ", [key for key in test_sample.keys()]) print("分词后的文档: ", test_sample["document"]) print("文档BIO标签: ", test_sample["doc_bio_tags"]) print("抽取/当前关键词: ", test_sample["extractive_keyphrases"]) print("抽象/缺失关键词: ", test_sample["abstractive_keyphrases"])

关键词提取

python

仅获取用于关键词提取的数据集

dataset = load_dataset("midas/duc2001", "extraction")

print("关键词提取的样本") test_sample = dataset["test"][0] print("样本中的字段: ", [key for key in test_sample.keys()]) print("分词后的文档: ", test_sample["document"]) print("文档BIO标签: ", test_sample["doc_bio_tags"])

关键词生成

python

仅获取用于关键词生成的数据集

dataset = load_dataset("midas/duc2001", "generation")

print("关键词生成的样本") test_sample = dataset["test"][0] print("样本中的字段: ", [key for key in test_sample.keys()]) print("分词后的文档: ", test_sample["document"]) print("抽取/当前关键词: ", test_sample["extractive_keyphrases"]) print("抽象/缺失关键词: ", test_sample["abstractive_keyphrases"])

引用信息

@inproceedings{10.5555/1620163.1620205, author = {Wan, Xiaojun and Xiao, Jianguo}, title = {Single Document Keyphrase Extraction Using Neighborhood Knowledge}, year = {2008}, isbn = {9781577353683}, publisher = {AAAI Press}, booktitle = {Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2}, pages = {855–860}, numpages = {6}, location = {Chicago, Illinois}, series = {AAAI08} }

搜集汇总

数据集介绍

构建方式

Midas/duc2001数据集是以英文新闻文章为来源，旨在为关键词提取和生成技术提供基准测试。该数据集的构建通过人工标注的方式，对每篇文章中的词汇进行BIO（Boundary Inside Outside）标记，以区分关键词和非关键词。数据集包含文档的唯一标识符、文档的分词列表、BIO标记、现有关键词列表以及缺失关键词列表，构建成为一个结构化数据集，以供后续研究使用。

特点

Midas/duc2001数据集的特点在于其精细的标注粒度和全面的测试用例。它不仅包含了显式出现的关键词（提取性关键词），还涵盖了文章中未提及但与之相关的关键词（抽象性关键词）。这种双重标注机制使得该数据集在关键词提取和生成任务中具有更高的参考价值。此外，数据集的测试分割提供了308个数据点，足以支撑模型的评估和比较。

使用方法

使用Midas/duc2001数据集时，用户可以根据需求选择完整数据集或仅针对关键词提取或生成任务的数据集。通过HuggingFace的datasets库，可以轻松加载整个数据集或其子集。加载数据后，用户可以获取文档的分词、BIO标记、提取性关键词和抽象性关键词等信息，进而用于训练、评估和测试相关模型。

背景与挑战

背景概述

Midas/duc2001数据集，专为评估英文新闻文章中的关键词提取与生成技术而构建，其研究成果详见原论文。该数据集源自2001年的DUC（Document Understanding Conference）评测任务，由Wan Xiaojun和Xiao Jianguo等研究人员提出，旨在促进文本摘要与关键词提取领域的发展。数据集包含文档的唯一标识符、文档文本、BIO标签、已提取的关键词和未提取的关键词等字段，为相关领域的研究提供了宝贵的资源，对自然语言处理任务尤其是关键词提取和生成任务产生了深远影响。

当前挑战

Midas/duc2001数据集在研究领域中面临的挑战主要包括：如何提高关键词提取的准确性和全面性，尤其是在区分关键词与非关键词的边界情况时；同时，构建过程中遇到的挑战涉及数据标注的一致性和准确性，以及如何平衡提取式和抽象式关键词的生成。这些挑战对研究人员的算法设计、模型训练和评估标准提出了更高的要求。

常用场景

经典使用场景

在自然语言处理领域，midas/duc2001数据集被广泛用于评估关键词提取与生成的有效性。该数据集通过提供新闻文章中的关键词标注，成为了研究者在关键短语提取和生成任务上的重要基准。

衍生相关工作

基于midas/duc2001数据集的研究衍生出了多种关键短语提取和生成算法，这些算法在文本挖掘、信息检索和自然语言理解等领域有着广泛的应用。此外，该数据集也促进了相关评价标准的制定，进一步推动了该领域的研究进展。

数据集最近研究