five

taln-ls2n/taln-archives

收藏
Hugging Face2022-09-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/taln-ls2n/taln-archives
下载链接
链接失效反馈
官方服务:
资源简介:
TALN-Archives是一个用于关键词提取和生成模型基准测试的数据集。该数据集包含1207篇法文科学论文的摘要,关键词由作者在非受控环境下标注。部分文档(456篇完全翻译和719篇部分翻译)提供了英文翻译,支持跨语言/多语言关键词生成实验。数据集还采用了PRMU方案对参考关键词进行分类,并提供了文本预处理和词干提取的详细信息。
提供机构:
taln-ls2n
原始信息汇总

TALN-Archives Benchmark Dataset for Keyphrase Generation

Overview

  • Languages: French (fr), English (en)
  • License: CC-BY-4.0
  • Multilinguality: Multilingual
  • Task Categories: Text-mining, Text-generation
  • Task IDs: Keyphrase-generation, Keyphrase-extraction
  • Size: 1K<n<10K
  • Pretty Name: TALN-Archives

Dataset Details

Content

  • Composition: 1207 abstracts of scientific papers in French.
  • Annotations: Keyphrases annotated by authors in an uncontrolled setting.
  • Translations: English translations available for 456 fully- and 719 partially-translated documents.

Structure

  • Data Fields:
    • id: Unique identifier of the document.
    • title: Title of the document.
    • abstract: Abstract of the document.
    • keyphrases: List of reference keyphrases.
    • prmu: List of Present-Reordered-Mixed-Unseen categories for reference keyphrases.
    • translation: Translations of title, abstract, and keyphrases in English if available.

Statistics

  • Test Split:
    Split # documents #words # keyphrases % Present % Reordered % Mixed % Unseen
    Test 1207 138.3 4.12 53.83 12.32 21.69 12.16

Processing

  • Text Pre-processing: Tokenization using spacy (fr_core_news_sm model) with a rule to keep words with hyphens as one token.
  • Stemming: Applied using Snowball stemmer for French in nltk before matching reference keyphrases against the source text.

References

  • Boudin, 2013: [TALN Archives : a digital archive of French research articles in Natural Language Processing].
  • Boudin and Gallina, 2021: [Redefining Absent Keyphrases and their Effect on Retrieval Effectiveness].
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作