paniniDot/sci_lay

Name: paniniDot/sci_lay
Creator: paniniDot
Published: 2023-09-05 16:39:49
License: 暂无描述

Hugging Face2023-09-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/paniniDot/sci_lay

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - summarization tags: - medical pretty_name: Sci Lay - Biomedic Articles Lay Summarization Dataset size_categories: - 10K<n<100K - 1K<n<10K source_datasets: - original dataset_info: - config_name: all features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 35026 num_bytes: 1579515071 - name: validation num_examples: 4380 num_bytes: 197196187 - name: test num_examples: 4384 num_bytes: 198833964 - config_name: NC features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 5549 num_bytes: 286453072 - name: validation num_examples: 694 num_bytes: 35652636 - name: test num_examples: 694 num_bytes: 35869803 - config_name: A features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 3909 num_bytes: 128936951 - name: validation num_examples: 489 num_bytes: 1303884 - name: test num_examples: 489 num_bytes: 1303884 - config_name: PLGEN features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 3087 num_bytes: 9651536 - name: validation num_examples: 386 num_bytes: 1195717 - name: test num_examples: 386 num_bytes: 1204735 - config_name: PLPAT features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 2920 num_bytes: 9311936 - name: validation num_examples: 365 num_bytes: 1161792 - name: test num_examples: 365 num_bytes: 1148729 - config_name: PLCB features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 2589 num_bytes: 149165851 - name: validation num_examples: 324 num_bytes: 1009541 - name: test num_examples: 324 num_bytes: 1013732 - config_name: PLNTD features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 2289 num_bytes: 7958581 - name: validation num_examples: 286 num_bytes: 990392 - name: test num_examples: 287 num_bytes: 996549 - config_name: B features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 1617 num_bytes: 57956055 - name: validation num_examples: 202 num_bytes: 547314 - name: test num_examples: 203 num_bytes: 537459 - config_name: I features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 1181 num_bytes: 37682107 - name: validation num_examples: 148 num_bytes: 393826 - name: test num_examples: 148 num_bytes: 390039 - config_name: PLB features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 896 num_bytes: 54106804 - name: validation num_examples: 112 num_bytes: 350955 - name: test num_examples: 113 num_bytes: 352922 - config_name: CB features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 867 num_bytes: 43533134 - name: validation num_examples: 108 num_bytes: 5664682 - name: test num_examples: 109 num_bytes: 172812 - config_name: SD features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 725 num_bytes: 23671697 - name: validation num_examples: 91 num_bytes: 3033467 - name: test num_examples: 91 num_bytes: 2972947 - config_name: MBIO features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 607 num_bytes: 1602641 - name: validation num_examples: 76 num_bytes: 203737 - name: test num_examples: 76 num_bytes: 200707 - config_name: C features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 6782 num_bytes: 242721690 - name: validation num_examples: 848 num_bytes: 30735056 - name: test num_examples: 848 num_bytes: 31018214 - config_name: OTHER features: - name: doi dtype: string - name: pmcid dtype: string - name: title dtype: string - name: plain_text dtype: string - name: technical_text dtype: string - name: full_text dtype: string - name: journal dtype: string - name: topics sequence: string - name: keywords sequence: string splits: - name: train num_examples: 2008 num_bytes: 89866504 - name: validation num_examples: 251 num_bytes: 11316433 - name: test num_examples: 251 num_bytes: 11564599 config_names: - all - NC - A - PLGEN - PLPAT - PLCB - PLNTD - B - I - PLB - CB - SD - MBIO - C - OTHER --- # Dataset Card for Sci Lay ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Sci Lay](https://github.com/paniniDot/summarization-model) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** [Mattia Panni](mailto:mattia.panni@studio.unibo.it) ### Dataset Summary SCILAY comprises 43,790 instances, each representing a scientific article in the biomedical domain. Each instance in the dataset includes the following components: - plain_text: Containing a plain language summary of the scientific article. This section is written in a simple and accessible language, and is intended to be understandable by a wide audience. - technical_text: This section contains the abstract of the scientific article. It provides a detailed and technical description of the research conducted in the article. - full_text: This section contains the complete article of the scientific research. In addition to the textual content, each instance is associated with the following metadata: - Keywords: Keywords that capture the main topics and themes addressed in the article. - Journal: The journal in which the article is published, providing context about the source of the research. - DOI (Digital Object Identifier): A unique identifier for the article, facilitating easy referencing. The main objective of the SCILAY dataset is to support the development and evaluation of text summarization models that can effectively simplify complex scientific language while retaining the essential information. Each article is published by a scientific journal. There are fifteen such journal classifications: - NC: Nature Communications - A: Animals : an Open Access Journal from MDPI - PLGEN: PLoS Genetics - PLPAT: PLoS Pathogens - PLCB: PLoS Computational Biology - PLNTD: PLoS Neglected Tropical Diseases - B: Biology - I: Insects - PLB: PLoS Biology - CB: Communications Biology - SD: Scientific Data - MBIO: mBio - C: Cancers - OTHER: which includes additional journals that taken individually would not have contributed sufficient instances Current defaults are 1.0.0 version (cased raw strings) and 'all' journals: ```python from datasets import load_dataset ds = load_dataset("paniniDot/sci_lay") # default is 'all' journals ds = load_dataset("paniniDot/sci_lay", "all") # the same as above ds = load_dataset("paniniDot/sci_lay", "NC") # only 'NC' journal (Nature Communications) ds = load_dataset("paniniDot/sci_lay", journals=["NC", "A"]) ``` ### Supported Tasks and Leaderboards [More Information Needed] ### Languages English ## Dataset Structure ### Data Instances Each instance contains a set of `doi`, `pmcid`, `plain_text`, `technical_text`, `journal`, `topics`, `keywords`. Each of which was extracted by scraping articles in XML and HTML format. ``` { 'doi': '10.3390/ani12040445', 'pmcid': 'PMC8868321', 'plain_text': 'PPP3CA is one of the candidate genes for goat reproduction, but no studies have been carried out yet. Therefore, the purpose of this study was to determine the associations between copy number variations in the goat PPP3CA gene and litter size and semen quality in goats, including Shaanbei white cashmere goats (SBWC) (n = 353) and Guizhou Heima (GZHM) goats (n = 64). Based on the association analysis, the results showed that only CNV1 (copy number variation 1) and CNV2 (copy number variation 2) were distinctly related to the first-birth litter size in female goats (p = 7.6802 × 10−11; p = 5.0895 × 10−9), and they were also significantly associated with the semen quality of SBWC goats (p < 0.05). These findings prove that the PPP3CA gene plays an important role in reproduction traits in goats.', 'technical_text': 'Copy number variations (CNVs) have many forms of variation structure, and they play an important role in the research of variety diversity, biological evolution and disease correlation. Since CNVs have a greater impact on gene regulation and expression, more studies are being finalized on CNVs in important livestock and poultry species. The protein phosphatase 3 catalytic subunit alpha (PPP3CA) is a key candidate gene involved in the goat fecundity trait, and has important effects on precocious puberty, estrogen signal transduction pathways and oocyte meiosis. Additionally, PPP3CA also has a dephosphorylation effect in the process of spermatogonial stem cell meiosis and spermatogenesis. So far, there is no research on the relationship between the copy number variations of the PPP3CA gene and reproduction traits. Therefore, the purpose of this study was to determine the association between copy number variations in the goat PPP3CA gene and litter size and semen quality in Shaanbei white cashmere goats (SBWC) (n = 353) and Guizhou Heima goats (n = 64). Based on the association analysis, the results showed that only CNV1 and CNV2 within the PPP3CA gene were distinctly related to the first-birth litter size in female goats (p = 7.6802 × 10−11; p = 5.0895 × 10−9, respectively) and they were also significantly associated with the semen quality of SBWC goats (p < 0.05). In addition, individuals with Loss genotypes demonstrated better phenotypic performance compared to those with other types. Therefore, CNV1 and CNV2 of the PPP3CA gene are potentially useful for breeding, as they are linked to important goat reproduction traits.', 'full_text': '...' 'journal': 'Animals : an Open Access Journal from MDPI', 'topics': [ 'Article' ], 'keywords': [ 'goat', 'PPP3CA', 'copy number variation (CNV)', 'litter size', 'semen quality' ] } ``` ### Data Fields - `doi`: (Digital Object Identifier). It is a unique alphanumeric string assigned to a digital document, such as a research paper, article, or dataset. Not all istances have it. - `pmcid`: A unique identifier in the [PubMed Central library](https://www.ncbi.nlm.nih.gov/pmc/) database. Not all istances have it. - `plain_text`: The summary of the article in plain english. - `technical_text`: The abstract of the article. - `full_text`: The complete article. - `journal`: The journal which published the article. - `topics`: An object containing the types in which the article is classified (i.e. Research Article, Review, ecc.). Not all istances have it. - `keywords`: An object containing the keywords of the article. Not all istances have it. ### Data Splits | | train | validation | test | |-------|-------|------------|------| | all | 35026 | 4380 | 4384 | | NC | 5549 | 694 | 694 | | A | 3909 | 489 | 489 | | PLGEN | 3087 | 386 | 386 | | PLPAT | 2920 | 365 | 365 | | PLCB | 2589 | 324 | 324 | | PLNTD | 2289 | 286 | 287 | | B | 1617 | 202 | 203 | | I | 1181 | 148 | 148 | | PLB | 896 | 112 | 113 | | CB | 867 | 108 | 109 | | SD | 725 | 91 | 91 | | MBIO | 607 | 76 | 76 | | C | 6782 | 848 | 848 | | OTHER | 2008 | 251 | 251 | ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information [More Information Needed]

提供机构：

paniniDot

原始信息汇总

数据集卡片 for Sci Lay

数据集描述

数据集摘要

SCILAY 包含 43,790 个实例，每个实例代表一个生物医学领域的科学文章。每个实例包含以下组件：

plain_text：科学文章的简单语言摘要，使用简单易懂的语言编写，旨在为广泛受众所理解。
technical_text：科学文章的摘要，提供研究内容的详细和技术性描述。
full_text：科学研究文章的完整内容。

此外，每个实例还包含以下元数据：

Keywords：捕捉文章主要主题和主题的关键词。
Journal：文章发表的期刊，提供研究来源的上下文。
DOI（数字对象标识符）：文章的唯一标识符，便于引用。

SCILAY 数据集的主要目标是支持开发和评估能够有效简化复杂科学语言同时保留关键信息的文本摘要模型。每个文章由一个科学期刊发表。共有十五个这样的期刊分类：

NC: Nature Communications
A: Animals : an Open Access Journal from MDPI
PLGEN: PLoS Genetics
PLPAT: PLoS Pathogens
PLCB: PLoS Computational Biology
PLNTD: PLoS Neglected Tropical Diseases
B: Biology
I: Insects
PLB: PLoS Biology
CB: Communications Biology
SD: Scientific Data
MBIO: mBio
C: Cancers
OTHER: 包括其他单独来看贡献实例不足的期刊

支持的任务和排行榜

[更多信息需要]

语言

英语

数据集结构

数据实例

每个实例包含一组 doi、pmcid、plain_text、technical_text、journal、topics、keywords。每个字段都是通过抓取 XML 和 HTML 格式的文章提取的。

json { "doi": "10.3390/ani12040445", "pmcid": "PMC8868321", "plain_text": "PPP3CA 是山羊繁殖的候选基因之一，但尚未进行相关研究。因此，本研究旨在确定山羊 PPP3CA 基因的拷贝数变异与山羊的产仔数和精液质量之间的关联，包括陕北白绒山羊（SBWC）（n = 353）和贵州黑山羊（GZHM）（n = 64）。基于关联分析，结果显示只有 CNV1（拷贝数变异 1）和 CNV2（拷贝数变异 2）与雌性山羊的首次产仔数显著相关（p = 7.6802 × 10−11；p = 5.0895 × 10−9），并且它们也与 SBWC 山羊的精液质量显著相关（p < 0.05）。这些发现证明 PPP3CA 基因在山羊的繁殖性状中起着重要作用。", "technical_text": "拷贝数变异（CNVs）有多种变异结构，在品种多样性、生物进化和疾病关联的研究中起着重要作用。由于 CNVs 对基因调控和表达有较大影响，越来越多的研究集中在重要畜禽品种的 CNVs 上。蛋白磷酸酶 3 催化亚基 alpha（PPP3CA）是山羊繁殖性状的关键候选基因，对早熟青春期、雌激素信号传导途径和卵母细胞减数分裂有重要影响。此外，PPP3CA 在精原干细胞减数分裂和精子发生过程中还具有去磷酸化作用。到目前为止，还没有研究 PPP3CA 基因的拷贝数变异与繁殖性状之间的关系。因此，本研究旨在确定山羊 PPP3CA 基因的拷贝数变异与陕北白绒山羊（SBWC）（n = 353）和贵州黑山羊（n = 64）的产仔数和精液质量之间的关联。基于关联分析，结果显示只有 PPP3CA 基因内的 CNV1 和 CNV2 与雌性山羊的首次产仔数显著相关（p = 7.6802 × 10−11；p = 5.0895 × 10−9），并且它们也与 SBWC 山羊的精液质量显著相关（p < 0.05）。此外，具有 Loss 基因型的个体在表型表现上优于其他类型。因此，PPP3CA 基因的 CNV1 和 CNV2 可能对育种有用，因为它们与重要的山羊繁殖性状相关联。", "full_text": "...", "journal": "Animals : an Open Access Journal from MDPI", "topics": ["Article"], "keywords": ["山羊", "PPP3CA", "拷贝数变异（CNV）", "产仔数", "精液质量"] }

数据字段

doi：（数字对象标识符）。它是分配给数字文档（如研究论文、文章或数据集）的唯一字母数字字符串。并非所有实例都有。
pmcid：PubMed Central 图书馆数据库中的唯一标识符。并非所有实例都有。
plain_text：文章的简单英语摘要。
technical_text：文章的摘要。
full_text：完整的文章。
journal：发表文章的期刊。
topics：包含文章分类类型的对象（如研究文章、评论等）。并非所有实例都有。
keywords：包含文章关键词的对象。并非所有实例都有。

数据分割

	train	validation	test
all	35026	4380	4384
NC	5549	694	694
A	3909	489	489
PLGEN	3087	386	386
PLPAT	2920	365	365
PLCB	2589	324	324
PLNTD	2289	286	287
B	1617	202	203
I	1181	148	148
PLB	896	112	113
CB	867	108	109
SD	725	91	91
MBIO	607	76	76
C	6782	848	848
OTHER	2008	251	251

数据集创建

策划理由

[更多信息需要]

源数据

初始数据收集和规范化

[更多信息需要]

源语言生产者是谁？

[更多信息需要]

注释

注释过程

[更多信息需要]

注释者是谁？

[更多信息需要]

个人和敏感信息

[更多信息需要]

使用数据集的注意事项

数据集的社会影响

[更多信息需要]

偏见的讨论

[更多信息需要]

其他已知限制

[更多信息需要]

附加信息

数据集策展人

[更多信息需要]

许可信息

[更多信息需要]

引用信息

[更多信息需要]

5,000+

优质数据集

54 个

任务类型

进入经典数据集