five

kanhatakeyama/nature-family-CC-papers

收藏
Hugging Face2024-02-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kanhatakeyama/nature-family-CC-papers
下载链接
链接失效反馈
官方服务:
资源简介:
# Text dataset of Nature family journals ## About - A scientific document dataset was constructed using open-access papers published by Springer Nature (https://www.springernature.com/). We focused on several journals under the Creative Commons License, including Nature Communications, npj Computational Materials, Nature Computational Science, Communications Chemistry, Communications Materials, and Scientific Reports. From these sources, we collected approximately 65,000 papers published between the 2010s and 2023, containing keywords such as chemistry, synthesis, molecule, polymer, material, and device. - License - Articles are distributed under the Creative Commons family license - e.g., CC 4.0, CC BY-ND 4.0, ... - Please check raw_data/ref_list.json for each license of the papers - Files - raw_data folder - ref_list.json - Raw text data of articles - About 65k papers - Each record has 'License', 'abstract', 'author', 'bib', 'doi', 'info', 'main', 'other', 'title', and 'ref_id' information. - context_list.json - List of introduction part of the articles - formatted_questions.json - List of automatically generated questions for some introduction texts - database folder - target: texts that are related to the evaluation data - instruct_eng: test questions and answers - abst_eng: abstract - dconcn_eng: conclusion - intro_eng: introduction - intro_esp_ger_ita: automatically translated introduction # NOTE: Texts with CC BY-ND license are excluded because distribution of translated ones is not allowed - irrelevant 1,2: texts that are not related to the evaluation data - smallDB folder - Formatted datasets used for our paper - qa.json: test questions and answer keywords - context_ig_paraphrase_plus_oa: context text, their style-changed versions, and irrelevant texts included in "irrelevant 1,2" --- license: cc language: - en ---
提供机构:
kanhatakeyama
原始信息汇总

自然家族期刊文本数据集

关于

  • 该数据集由施普林格·自然(Springer Nature)出版的开放获取论文构建,重点关注了几个采用Creative Commons许可证的期刊,包括Nature Communications、npj Computational Materials、Nature Computational Science、Communications Chemistry、Communications Materials和Scientific Reports。数据集收集了大约65,000篇发表于2010年至2023年间的论文,包含化学、合成、分子、聚合物、材料和设备等关键词。

许可证

  • 文章采用Creative Commons家族许可证分发,例如CC 4.0、CC BY-ND 4.0等。每篇论文的具体许可证信息可在raw_data/ref_list.json中查看。

文件结构

  • raw_data文件夹

    • ref_list.json:包含约65,000篇论文的原始文本数据,每条记录包含License、abstract、author、bib、doi、info、main、other、title和ref_id信息。
    • context_list.json:文章引言部分的列表。
    • formatted_questions.json:为部分引言文本自动生成的问答列表。
  • database文件夹

    • target:与评估数据相关的文本
      • instruct_eng:测试问答
      • abst_eng:摘要
      • dconcn_eng:结论
      • intro_eng:引言
      • intro_esp_ger_ita:自动翻译的引言(注:由于分发翻译文本不被允许,CC BY-ND许可证的文本被排除)
    • irrelevant 1,2:与评估数据无关的文本
  • smallDB文件夹

    • 用于论文的格式化数据集
      • qa.json:测试问答及答案关键词
      • context_ig_paraphrase_plus_oa:上下文文本、风格变化版本及包含在"irrelevant 1,2"中的无关文本
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作