five

CarlyChin/scientific_papers

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/CarlyChin/scientific_papers
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - en language_creators: - found license: - unknown multilinguality: - monolingual pretty_name: ScientificPapers size_categories: - 100K<n<1M source_datasets: - original task_categories: - summarization task_ids: [] paperswithcode_id: null tags: - abstractive-summarization dataset_info: - config_name: arxiv features: - name: article dtype: string - name: abstract dtype: string - name: section_names dtype: string splits: - name: train num_bytes: 7148341992 num_examples: 203037 - name: validation num_bytes: 217125524 num_examples: 6436 - name: test num_bytes: 217514961 num_examples: 6440 download_size: 4504646347 dataset_size: 7582982477 - config_name: pubmed features: - name: article dtype: string - name: abstract dtype: string - name: section_names dtype: string splits: - name: train num_bytes: 2252027383 num_examples: 119924 - name: validation num_bytes: 127403398 num_examples: 6633 - name: test num_bytes: 127184448 num_examples: 6658 download_size: 4504646347 dataset_size: 2506615229 --- # Dataset Card for "scientific_papers" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** https://github.com/armancohan/long-summarization - **Paper:** [A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents](https://arxiv.org/abs/1804.05685) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 9.01 GB - **Size of the generated dataset:** 10.09 GB - **Total amount of disk used:** 19.10 GB ### Dataset Summary Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories. Both "arxiv" and "pubmed" have two features: - article: the body of the document, paragraphs separated by "/n". - abstract: the abstract of the document, paragraphs separated by "/n". - section_names: titles of sections, separated by "/n". ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### arxiv - **Size of downloaded dataset files:** 4.50 GB - **Size of the generated dataset:** 7.58 GB - **Total amount of disk used:** 12.09 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "abstract": "\" we have studied the leptonic decay @xmath0 , via the decay channel @xmath1 , using a sample of tagged @xmath2 decays collected...", "article": "\"the leptonic decays of a charged pseudoscalar meson @xmath7 are processes of the type @xmath8 , where @xmath9 , @xmath10 , or @...", "section_names": "[sec:introduction]introduction\n[sec:detector]data and the cleo- detector\n[sec:analysys]analysis method\n[sec:conclusion]summary" } ``` #### pubmed - **Size of downloaded dataset files:** 4.50 GB - **Size of the generated dataset:** 2.51 GB - **Total amount of disk used:** 7.01 GB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "abstract": "\" background and aim : there is lack of substantial indian data on venous thromboembolism ( vte ) . \\n the aim of this study was...", "article": "\"approximately , one - third of patients with symptomatic vte manifests pe , whereas two - thirds manifest dvt alone .\\nboth dvt...", "section_names": "\"Introduction\\nSubjects and Methods\\nResults\\nDemographics and characteristics of venous thromboembolism patients\\nRisk factors ..." } ``` ### Data Fields The data fields are the same among all splits. #### arxiv - `article`: a `string` feature. - `abstract`: a `string` feature. - `section_names`: a `string` feature. #### pubmed - `article`: a `string` feature. - `abstract`: a `string` feature. - `section_names`: a `string` feature. ### Data Splits | name |train |validation|test| |------|-----:|---------:|---:| |arxiv |203037| 6436|6440| |pubmed|119924| 6633|6658| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{Cohan_2018, title={A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents}, url={http://dx.doi.org/10.18653/v1/n18-2097}, DOI={10.18653/v1/n18-2097}, journal={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)}, publisher={Association for Computational Linguistics}, author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli}, year={2018} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@jplu](https://github.com/jplu), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.
提供机构:
CarlyChin
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作