taln-ls2n/taln-archives
收藏Hugging Face2022-09-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/taln-ls2n/taln-archives
下载链接
链接失效反馈官方服务:
资源简介:
TALN-Archives是一个用于关键词提取和生成模型基准测试的数据集。该数据集包含1207篇法文科学论文的摘要,关键词由作者在非受控环境下标注。部分文档(456篇完全翻译和719篇部分翻译)提供了英文翻译,支持跨语言/多语言关键词生成实验。数据集还采用了PRMU方案对参考关键词进行分类,并提供了文本预处理和词干提取的详细信息。
提供机构:
taln-ls2n
原始信息汇总
TALN-Archives Benchmark Dataset for Keyphrase Generation
Overview
- Languages: French (fr), English (en)
- License: CC-BY-4.0
- Multilinguality: Multilingual
- Task Categories: Text-mining, Text-generation
- Task IDs: Keyphrase-generation, Keyphrase-extraction
- Size: 1K<n<10K
- Pretty Name: TALN-Archives
Dataset Details
Content
- Composition: 1207 abstracts of scientific papers in French.
- Annotations: Keyphrases annotated by authors in an uncontrolled setting.
- Translations: English translations available for 456 fully- and 719 partially-translated documents.
Structure
- Data Fields:
- id: Unique identifier of the document.
- title: Title of the document.
- abstract: Abstract of the document.
- keyphrases: List of reference keyphrases.
- prmu: List of Present-Reordered-Mixed-Unseen categories for reference keyphrases.
- translation: Translations of title, abstract, and keyphrases in English if available.
Statistics
- Test Split:
Split # documents #words # keyphrases % Present % Reordered % Mixed % Unseen Test 1207 138.3 4.12 53.83 12.32 21.69 12.16
Processing
- Text Pre-processing: Tokenization using
spacy(fr_core_news_smmodel) with a rule to keep words with hyphens as one token. - Stemming: Applied using Snowball stemmer for French in
nltkbefore matching reference keyphrases against the source text.
References
- Boudin, 2013: [TALN Archives : a digital archive of French research articles in Natural Language Processing].
- Boudin and Gallina, 2021: [Redefining Absent Keyphrases and their Effect on Retrieval Effectiveness].



