five

s2orc

收藏
魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/s2orc
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for S2ORC This dataset contains titles, abstracts, and citations from scientific papers from the [Semantic Scholar Open Research Corpus (S2ORC)](https://github.com/allenai/s2orc). This dataset can and has been used to train embedding models, and works out of the box to train or finetune [Sentence Transformer](https://sbert.net/) models. In our experiments, title-abstract pairs result in the highest performance, followed by titles-citations and then abstract-citations pairs. ## Dataset Subsets ### `title-abstract-pair` subset * Columns: "title", "abstract" * Column types: `str`, `str` * Examples: ```python { "title": "Syntheses, Structures and Properties of Two Transition Metal-Flexible Ligand Coordination Polymers", "abstract": "Two coordination polymers based on 3,5-bis(4-carboxyphenylmethyloxy) benzoic acid (H3L), [M(HL)]·2H2O M = Mn(1), Co(2), have been synthesized under hydrothermal conditions. Their structures have been determined by single-crystal X-ray diffraction and further characterized by elemental analysis, IR spectra and TGA. The two complexes possess 3D framework with diamond channels resulting from the trans-configuration of the flexible ligand and three coordination modes, 3(η2, η1), 2(η1, η1), η1, of carboxyl groups in the ligand. The framework can be represented with Schlafli symbol of (48·66)(47·66). The wall of the channel consists of left- or right-handed helical polymeric chains. UV–visible–NIR and photoluminescence spectra, magnetic properties of 1 and 2 have also been discussed.", } ``` * Collection strategy: Reading the S2ORC titles-abstract dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data). * Deduplified: No ### `title-citation-pair` subset * Columns: "title", "citation" * Column types: `str`, `str` * Examples: ```python { "title": "An apparent neuroleptic malignant syndrome without extrapyramidal symptoms upon initiation of clozapine therapy: report of a case and results of a clozapine rechallenge.", "citation": "Antipsychotic Rechallenge After Neuroleptic Malignant Syndrome with Catatonic Features" } ``` * Collection strategy: Reading the S2ORC titles-citation dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) and considering each title together with the first citation as a sample. * Deduplified: No ### `abstract-citation-pair` subset * Columns: "abstract", "citation" * Column types: `str`, `str` * Examples: ```python { "abstract": "The androgen receptor (AR) is a ligand-regulated transcription factor that stimulates cell growth and differentiation in androgen-responsive tissues. The AR N terminus contains two activation functions (AF-1a and AF-1b) that are necessary for maximal transcriptional enhancement by the receptor; however, the mechanisms and components regulating AR transcriptional activation are not fully understood. We sought to identify novel factors that interact with the AR N terminus from an androgen-stimulated human prostate cancer cell library using a yeast two-hybrid approach designed to identify proteins that interact with transcriptional activation domains. A 157-amino acid protein termed ART-27 was cloned and shown to interact predominantly with the AR153–336, containing AF-1a and a part of AF-1b, localize to the nucleus and increase the transcriptional activity of AR when overexpressed in cultured mammalian cells. ART-27 also enhanced the transcriptional activation by AR153–336 fused to the LexA DNA-binding domain but not other AR N-terminal subdomains, suggesting that ART-27 exerts its effect via an interaction with a defined region of the AR N terminus. ART-27 interacts with AR in nuclear extracts from LNCaP cells in a ligand-independent manner. Interestingly, velocity gradient sedimentation of HeLa nuclear extracts suggests that native ART-27 is part of a multiprotein complex. ART-27 is expressed in a variety of human tissues, including sites of androgen action such as prostate and skeletal muscle, and is conserved throughout evolution. Thus, ART-27 is a novel cofactor that interacts with the AR N terminus and plays a role in facilitating receptor-induced transcriptional activation.", "citation": "Androgen-insensitivity syndromes in 46,XY fetuses result in various degrees of impairment in genital virilization.1 These syndromes are caused by mutations in the androgen receptor gene that result in decreased binding of androgen to the receptor.2–9 As a consequence, the transcriptional activity of the androgen–androgen-receptor complex is reduced, and therefore, genital virilization is reduced. The androgen receptor, like other steroid hormone receptors, has two major transactivation domains10 — activation function 1 (AF-1) in the N-terminal region11–13 and activation function 2 (AF-2) in the C-terminal ligand-binding domain14 — that interact with the target genes directly as well as indirectly by . . .", } ``` * Collection strategy: Reading the S2ORC abstract-citation dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) and considering each citation together with the first abstract as a sample. * Deduplified: No

# S2ORC 数据集卡片 本数据集收录了来自[Semantic Scholar开放研究语料库(Semantic Scholar Open Research Corpus, S2ORC)](https://github.com/allenai/s2orc)的学术论文标题、摘要及引用文献。 本数据集可且已被应用于嵌入模型的训练,且无需额外配置即可直接用于[Sentence Transformer](https://sbert.net/)模型的训练与微调。 在我们的实验中,标题-摘要配对样本的综合表现最优,其次为标题-引用配对样本,再次为摘要-引用配对样本。 ## 数据集子集 ### `title-abstract-pair`(标题-摘要配对)子集 * 字段:"title"、"abstract" * 字段类型:字符串(`str`)、字符串(`str`) * 示例: python { "title": "Syntheses, Structures and Properties of Two Transition Metal-Flexible Ligand Coordination Polymers", "abstract": "Two coordination polymers based on 3,5-bis(4-carboxyphenylmethyloxy) benzoic acid (H3L), [M(HL)]·2H2O M = Mn(1), Co(2), have been synthesized under hydrothermal conditions. Their structures have been determined by single-crystal X-ray diffraction and further characterized by elemental analysis, IR spectra and TGA. The two complexes possess 3D framework with diamond channels resulting from the trans-configuration of the flexible ligand and three coordination modes, 3(η2, η1), 2(η1, η1), η1, of carboxyl groups in the ligand. The framework can be represented with Schlafli symbol of (48·66)(47·66). The wall of the channel consists of left- or right-handed helical polymeric chains. UV–visible–NIR and photoluminescence spectra, magnetic properties of 1 and 2 have also been discussed.", } * 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)中读取S2ORC的标题-摘要数据集。 * 去重:否 ### `title-citation-pair`(标题-引用配对)子集 * 字段:"title"、"citation" * 字段类型:字符串(`str`)、字符串(`str`) * 示例: python { "title": "An apparent neuroleptic malignant syndrome without extrapyramidal symptoms upon initiation of clozapine therapy: report of a case and results of a clozapine rechallenge.", "citation": "Antipsychotic Rechallenge After Neuroleptic Malignant Syndrome with Catatonic Features" } * 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)中读取S2ORC的标题-引用数据集,并将每篇标题与其第一条引用作为一个样本。 * 去重:否 ### `abstract-citation-pair`(摘要-引用配对)子集 * 字段:"abstract"、"citation" * 字段类型:字符串(`str`)、字符串(`str`) * 示例: python { "abstract": "The androgen receptor (AR) is a ligand-regulated transcription factor that stimulates cell growth and differentiation in androgen-responsive tissues. The AR N terminus contains two activation functions (AF-1a and AF-1b) that are necessary for maximal transcriptional enhancement by the receptor; however, the mechanisms and components regulating AR transcriptional activation are not fully understood. We sought to identify novel factors that interact with the AR N terminus from an androgen-stimulated human prostate cancer cell library using a yeast two-hybrid approach designed to identify proteins that interact with transcriptional activation domains. A 157-amino acid protein termed ART-27 was cloned and shown to interact predominantly with the AR153–336, containing AF-1a and a part of AF-1b, localize to the nucleus and increase the transcriptional activity of AR when overexpressed in cultured mammalian cells. ART-27 also enhanced the transcriptional activation by AR153–336 fused to the LexA DNA-binding domain but not other AR N-terminal subdomains, suggesting that ART-27 exerts its effect via an interaction with a defined region of the AR N terminus. ART-27 interacts with AR in nuclear extracts from LNCaP cells in a ligand-independent manner. Interestingly, velocity gradient sedimentation of HeLa nuclear extracts suggests that native ART-27 is part of a multiprotein complex. ART-27 is expressed in a variety of human tissues, including sites of androgen action such as prostate and skeletal muscle, and is conserved throughout evolution. Thus, ART-27 is a novel cofactor that interacts with the AR N terminus and plays a role in facilitating receptor-induced transcriptional activation.", "citation": "Androgen-insensitivity syndromes in 46,XY fetuses result in various degrees of impairment in genital virilization.1 These syndromes are caused by mutations in the androgen receptor gene that result in decreased binding of androgen to the receptor.2–9 As a consequence, the transcriptional activity of the androgen–androgen-receptor complex is reduced, and therefore, genital virilization is reduced. The androgen receptor, like other steroid hormone receptors, has two major transactivation domains10 — activation function 1 (AF-1) in the N-terminal region11–13 and activation function 2 (AF-2) in the C-terminal ligand-binding domain14 — that interact with the target genes directly as well as indirectly by . . .", } * 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)中读取S2ORC的摘要-引用数据集,并将每条引用与其对应的第一篇摘要作为一个样本。 * 去重:否
提供机构:
maas
创建时间:
2025-01-06
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
S2ORC数据集包含科学论文的标题、摘要和引用信息,适用于训练嵌入模型,特别是Sentence Transformer模型。数据集分为三个子集:标题-摘要对、标题-引用对和摘要-引用对,每个子集都有明确的数据结构和示例。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作