five

MLP-SEMO/IT_datasets

收藏
Hugging Face2024-05-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-SEMO/IT_datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: BigPatent features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 3056976240 num_examples: 50000 download_size: 1272481947 dataset_size: 3056976240 - config_name: BookSum features: - name: output dtype: string - name: context dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 476317871 num_examples: 9600 - name: validation num_bytes: 63750022 num_examples: 1484 - name: test num_bytes: 71934433 num_examples: 1431 download_size: 363438025 dataset_size: 612002326 - config_name: BoolQ features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 12853661 num_examples: 9427 - name: validation num_bytes: 4410696 num_examples: 3270 - name: test num_bytes: 4386660 num_examples: 3245 download_size: 12607245 dataset_size: 21651017 - config_name: CNN-DM features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 863427270 num_examples: 100000 - name: validation num_bytes: 112409140 num_examples: 13368 - name: test num_bytes: 97428338 num_examples: 11490 download_size: 652210853 dataset_size: 1073264748 - config_name: CosmosQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 26971402 num_examples: 25262 - name: test num_bytes: 8004035 num_examples: 6963 - name: validation num_bytes: 3421792 num_examples: 2985 download_size: 15773346 dataset_size: 38397229 - config_name: DROP features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 26295805 num_examples: 10000 - name: validation num_bytes: 21971393 num_examples: 9535 download_size: 14620500 dataset_size: 48267198 - config_name: GovReport features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1868279205 num_examples: 17517 - name: validation num_bytes: 108895887 num_examples: 973 - name: test num_bytes: 100349892 num_examples: 973 download_size: 1001067529 dataset_size: 2077524984 - config_name: HotpotQA features: - name: instruction dtype: string - name: output dtype: string - name: context sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1048709644 num_examples: 90447 - name: validation num_bytes: 86820188 num_examples: 7405 download_size: 671874576 dataset_size: 1135529832 - config_name: LongAlpaca features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 853542732 num_examples: 8937 download_size: 436405773 dataset_size: 853542732 - config_name: MultiNews features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1063566489 num_examples: 44972 download_size: 618228866 dataset_size: 1063566489 - config_name: MultiRC features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 39225163 num_examples: 12025 download_size: 1607874 dataset_size: 39225163 - config_name: NarrativeQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 21589984139 num_examples: 32747 download_size: 10012303798 dataset_size: 21589984139 - config_name: QMsum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 130848491 num_examples: 1257 download_size: 43933066 dataset_size: 130848491 - config_name: Qasper features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 132355093 num_examples: 2567 download_size: 41697628 dataset_size: 132355093 - config_name: Quality features: - name: instruction dtype: string - name: output dtype: string - name: context dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 125840855 num_examples: 2523 download_size: 18496863 dataset_size: 125840855 - config_name: ReCoRD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 238582023 num_examples: 100730 - name: validation num_bytes: 23362645 num_examples: 10000 - name: test num_bytes: 23217267 num_examples: 10000 download_size: 118874876 dataset_size: 285161935 - config_name: SQuAD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 148929842 num_examples: 87599 - name: validation num_bytes: 18980126 num_examples: 10570 download_size: 26422546 dataset_size: 167909968 - config_name: XSum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 467899254 num_examples: 100000 - name: validation num_bytes: 52521877 num_examples: 11332 - name: test num_bytes: 53464225 num_examples: 11334 download_size: 358953960 dataset_size: 573885356 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - split: validation path: BookSum/validation-* - split: test path: BookSum/test-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - split: validation path: BoolQ/validation-* - split: test path: BoolQ/test-* - config_name: CNN-DM data_files: - split: train path: CNN-DM/train-* - split: validation path: CNN-DM/validation-* - split: test path: CNN-DM/test-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - split: test path: CosmosQA/test-* - split: validation path: CosmosQA/validation-* - config_name: DROP data_files: - split: train path: DROP/train-* - split: validation path: DROP/validation-* - config_name: GovReport data_files: - split: train path: GovReport/train-* - split: validation path: GovReport/validation-* - split: test path: GovReport/test-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - split: validation path: HotpotQA/validation-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - split: validation path: ReCoRD/validation-* - split: test path: ReCoRD/test-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - split: validation path: SQuAD/validation-* - config_name: XSum data_files: - split: train path: XSum/train-* - split: validation path: XSum/validation-* - split: test path: XSum/test-* ---

The provided README content outlines multiple datasets, each with specific configurations, features, and splits. Each dataset configuration includes details such as the name, features (like context, output, instruction, and context_sentence), and the splits (train, validation, test) with corresponding number of examples and bytes. The README also provides information about the download and dataset sizes for each configuration.
提供机构:
MLP-SEMO
原始信息汇总

数据集概述

BigPatent

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 50000个样本,3056976240字节
  • 下载大小: 1272481947字节
  • 数据集大小: 3056976240字节

BookSum

  • 特征:
    • output: 字符串
    • context: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 9600个样本,476317871字节
    • validation: 1484个样本,63750022字节
    • test: 1431个样本,71934433字节
  • 下载大小: 363438025字节
  • 数据集大小: 612002326字节

BoolQ

  • 特征:
    • instruction: 字符串
    • context: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 9427个样本,12853661字节
    • validation: 3270个样本,4410696字节
    • test: 3245个样本,4386660字节
  • 下载大小: 12607245字节
  • 数据集大小: 21651017字节

CNN-DM

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 100000个样本,863427270字节
    • validation: 13368个样本,112409140字节
    • test: 11490个样本,97428338字节
  • 下载大小: 652210853字节
  • 数据集大小: 1073264748字节

CosmosQA

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 25262个样本,26971402字节
    • validation: 2985个样本,3421792字节
    • test: 6963个样本,8004035字节
  • 下载大小: 15773346字节
  • 数据集大小: 38397229字节

DROP

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 10000个样本,26295805字节
    • validation: 9535个样本,21971393字节
  • 下载大小: 14620500字节
  • 数据集大小: 48267198字节

GovReport

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 17517个样本,1868279205字节
    • validation: 973个样本,108895887字节
    • test: 973个样本,100349892字节
  • 下载大小: 1001067529字节
  • 数据集大小: 2077524984字节

HotpotQA

  • 特征:
    • instruction: 字符串
    • output: 字符串
    • context: 字符串序列
    • context_sentence: 字符串序列
  • 分割:
    • train: 90447个样本,1048709644字节
    • validation: 7405个样本,86820188字节
  • 下载大小: 671874576字节
  • 数据集大小: 1135529832字节

LongAlpaca

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 8937个样本,853542732字节
  • 下载大小: 436405773字节
  • 数据集大小: 853542732字节

MultiNews

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 44972个样本,1063566489字节
  • 下载大小: 618228866字节
  • 数据集大小: 1063566489字节

MultiRC

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 12025个样本,39225163字节
  • 下载大小: 1607874字节
  • 数据集大小: 39225163字节

NarrativeQA

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 32747个样本,21589984139字节
  • 下载大小: 10012303798字节
  • 数据集大小: 21589984139字节

QMsum

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 1257个样本,130848491字节
  • 下载大小: 43933066字节
  • 数据集大小: 130848491字节

Qasper

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 2567个样本,132355093字节
  • 下载大小: 41697628字节
  • 数据集大小: 132355093字节

Quality

  • 特征:
    • instruction: 字符串
    • output: 字符串
    • context: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 2523个样本,125840855字节
  • 下载大小: 18496863字节
  • 数据集大小: 125840855字节

ReCoRD

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 100730个样本,238582023字节
    • validation: 10000个样本,23362645字节
    • test: 10000个样本,23217267字节
  • 下载大小: 118874876字节
  • 数据集大小: 285161935字节

SQuAD

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 87599个样本,148929842字节
    • validation: 10570个样本,18980126字节
  • 下载大小: 26422546字节
  • 数据集大小: 167909968字节

XSum

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • context_sentence: 字符串序列
  • 分割:
    • train: 100000个样本,467899254字节
    • validation: 11332个样本,52521877字节
    • test: 11334个样本,53464225字节
  • 下载大小: 358953960字节
  • 数据集大小: 573885356字节
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作