taln-ls2n/taln-archives

Name: taln-ls2n/taln-archives
Creator: taln-ls2n
Published: 2022-09-23 07:58:07
License: 暂无描述

Hugging Face2022-09-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/taln-ls2n/taln-archives

下载链接

链接失效反馈

官方服务：

资源简介：

TALN-Archives是一个用于关键词提取和生成模型基准测试的数据集。该数据集包含1207篇法文科学论文的摘要，关键词由作者在非受控环境下标注。部分文档（456篇完全翻译和719篇部分翻译）提供了英文翻译，支持跨语言/多语言关键词生成实验。数据集还采用了PRMU方案对参考关键词进行分类，并提供了文本预处理和词干提取的详细信息。

提供机构：

taln-ls2n

原始信息汇总

TALN-Archives Benchmark Dataset for Keyphrase Generation

Overview

Languages: French (fr), English (en)
License: CC-BY-4.0
Multilinguality: Multilingual
Task Categories: Text-mining, Text-generation
Task IDs: Keyphrase-generation, Keyphrase-extraction
Size: 1K<n<10K
Pretty Name: TALN-Archives

Dataset Details

Content

Composition: 1207 abstracts of scientific papers in French.
Annotations: Keyphrases annotated by authors in an uncontrolled setting.
Translations: English translations available for 456 fully- and 719 partially-translated documents.

Structure

Data Fields:
- id: Unique identifier of the document.
- title: Title of the document.
- abstract: Abstract of the document.
- keyphrases: List of reference keyphrases.
- prmu: List of Present-Reordered-Mixed-Unseen categories for reference keyphrases.
- translation: Translations of title, abstract, and keyphrases in English if available.

Statistics

Test Split:

Split # documents #words # keyphrases % Present % Reordered % Mixed % Unseen

Test 1207 138.3 4.12 53.83 12.32 21.69 12.16

Processing

Text Pre-processing: Tokenization using spacy (fr_core_news_sm model) with a rule to keep words with hyphens as one token.
Stemming: Applied using Snowball stemmer for French in nltk before matching reference keyphrases against the source text.

References

Boudin, 2013: [TALN Archives : a digital archive of French research articles in Natural Language Processing].
Boudin and Gallina, 2021: [Redefining Absent Keyphrases and their Effect on Retrieval Effectiveness].

5,000+

优质数据集

54 个

任务类型

进入经典数据集