kanhatakeyama/nature-family-CC-papers
收藏Hugging Face2024-02-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kanhatakeyama/nature-family-CC-papers
下载链接
链接失效反馈官方服务:
资源简介:
# Text dataset of Nature family journals
## About
- A scientific document dataset was constructed using open-access papers published by Springer Nature (https://www.springernature.com/). We focused on several journals under the Creative Commons License, including Nature Communications, npj Computational Materials, Nature Computational Science, Communications Chemistry, Communications Materials, and Scientific Reports. From these sources, we collected approximately 65,000 papers published between the 2010s and 2023, containing keywords such as chemistry, synthesis, molecule, polymer, material, and device.
- License
- Articles are distributed under the Creative Commons family license
- e.g., CC 4.0, CC BY-ND 4.0, ...
- Please check raw_data/ref_list.json for each license of the papers
- Files
- raw_data folder
- ref_list.json
- Raw text data of articles
- About 65k papers
- Each record has 'License', 'abstract', 'author', 'bib', 'doi', 'info', 'main', 'other', 'title', and 'ref_id' information.
- context_list.json
- List of introduction part of the articles
- formatted_questions.json
- List of automatically generated questions for some introduction texts
- database folder
- target: texts that are related to the evaluation data
- instruct_eng: test questions and answers
- abst_eng: abstract
- dconcn_eng: conclusion
- intro_eng: introduction
- intro_esp_ger_ita: automatically translated introduction
# NOTE: Texts with CC BY-ND license are excluded because distribution of translated ones is not allowed
- irrelevant 1,2: texts that are not related to the evaluation data
- smallDB folder
- Formatted datasets used for our paper
- qa.json: test questions and answer keywords
- context_ig_paraphrase_plus_oa: context text, their style-changed versions, and irrelevant texts included in "irrelevant 1,2"
---
license: cc
language:
- en
---
提供机构:
kanhatakeyama
原始信息汇总
自然家族期刊文本数据集
关于
- 该数据集由施普林格·自然(Springer Nature)出版的开放获取论文构建,重点关注了几个采用Creative Commons许可证的期刊,包括Nature Communications、npj Computational Materials、Nature Computational Science、Communications Chemistry、Communications Materials和Scientific Reports。数据集收集了大约65,000篇发表于2010年至2023年间的论文,包含化学、合成、分子、聚合物、材料和设备等关键词。
许可证
- 文章采用Creative Commons家族许可证分发,例如CC 4.0、CC BY-ND 4.0等。每篇论文的具体许可证信息可在
raw_data/ref_list.json中查看。
文件结构
-
raw_data文件夹
ref_list.json:包含约65,000篇论文的原始文本数据,每条记录包含License、abstract、author、bib、doi、info、main、other、title和ref_id信息。context_list.json:文章引言部分的列表。formatted_questions.json:为部分引言文本自动生成的问答列表。
-
database文件夹
target:与评估数据相关的文本instruct_eng:测试问答abst_eng:摘要dconcn_eng:结论intro_eng:引言intro_esp_ger_ita:自动翻译的引言(注:由于分发翻译文本不被允许,CC BY-ND许可证的文本被排除)
irrelevant 1,2:与评估数据无关的文本
-
smallDB文件夹
- 用于论文的格式化数据集
qa.json:测试问答及答案关键词context_ig_paraphrase_plus_oa:上下文文本、风格变化版本及包含在"irrelevant 1,2"中的无关文本
- 用于论文的格式化数据集



