five

reichenbach/arxiv_ppr_embeds

收藏
Hugging Face2023-08-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/reichenbach/arxiv_ppr_embeds
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language: - en language_creators: - found license: - unknown multilinguality: - monolingual pretty_name: ScientificPapers size_categories: - 100K<n<1M source_datasets: - scientific_papers task_categories: - summarization task_ids: [] paperswithcode_id: null tags: - abstractive-summarization dataset_info: features: - name: article dtype: string - name: abstract dtype: string - name: embeddings sequence: float64 splits: - name: train num_bytes: 8367611540 num_examples: 203037 - name: validation num_bytes: 256178362 num_examples: 6440 - name: test num_bytes: 255771184 num_examples: 6436 download_size: 4718720913 dataset_size: 8879561086 --- # Dataset Card for "scientific_papers" This dataset is derived from https://huggingface.co/datasets/scientific_papers with additional creation of embeddings via https://huggingface.co/docs/transformers/model_doc/rag for Natural Questions trained Base Model. This dataset is created for purpose of Retrieval Augmented Generation examples and experiments. ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** https://github.com/armancohan/long-summarization - **Paper:** [A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents](https://arxiv.org/abs/1804.05685) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary Scientific papers datasets contains one sets of long and structured documents. The datasets are obtained from ArXiv repositories. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### arxiv - **Size of downloaded dataset files:** 4.50 GB - **Size of the generated dataset:** 7.58 GB - **Total amount of disk used:** 12.09 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "abstract": "\" we have studied the leptonic decay @xmath0 , via the decay channel @xmath1 , using a sample of tagged @xmath2 decays collected...", "article": "\"the leptonic decays of a charged pseudoscalar meson @xmath7 are processes of the type @xmath8 , where @xmath9 , @xmath10 , or @...", "section_names": "[sec:introduction]introduction\n[sec:detector]data and the cleo- detector\n[sec:analysys]analysis method\n[sec:conclusion]summary" } ``` ### Data Fields The data fields are the same among all splits. #### arxiv - `article`: a `string` feature. - `abstract`: a `string` feature. - `section_names`: a `string` feature. - `embeddings`: a `float` 768 dimensional vector ### Data Splits | name |train |validation|test| |------|-----:|---------:|---:| |arxiv |203037| 6436|6440| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{Cohan_2018, title={A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents}, url={http://dx.doi.org/10.18653/v1/n18-2097}, DOI={10.18653/v1/n18-2097}, journal={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)}, publisher={Association for Computational Linguistics}, author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli}, year={2018} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@jplu](https://github.com/jplu), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.
提供机构:
reichenbach
原始信息汇总

数据集概述

数据集名称

  • 名称: ScientificPapers
  • 别名: scientific_papers

数据集属性

  • 语言: 英语 (en)
  • 多语言性: 单语种
  • 许可证: 未知
  • 大小: 100K<n<1M
  • 任务类别: 摘要生成 (summarization)
  • 标签: 抽象摘要 (abstractive-summarization)

数据集结构

  • 特征:
    • article: 字符串类型
    • abstract: 字符串类型
    • embeddings: 序列类型,浮点数64位
  • 分割:
    • 训练集: 203037个样本,8367611540字节
    • 验证集: 6440个样本,256178362字节
    • 测试集: 6436个样本,255771184字节
  • 下载大小: 4718720913字节
  • 数据集大小: 8879561086字节

数据集来源

  • 源数据集: 科学论文
  • 数据集创建目的: 用于增强检索生成示例和实验

引用信息

@article{Cohan_2018, title={A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents}, url={http://dx.doi.org/10.18653/v1/n18-2097}, DOI={10.18653/v1/n18-2097}, journal={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)}, publisher={Association for Computational Linguistics}, author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli}, year={2018} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作