five

MLP-SEMO/IT_data_old

收藏
Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-SEMO/IT_data_old
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: BigPatent features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 3170511042 num_examples: 50000 download_size: 1285289620 dataset_size: 3170511042 - config_name: BookSum features: - name: output dtype: string - name: context dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 478923658 num_examples: 9600 - name: validation num_bytes: 63995551 num_examples: 1484 - name: test num_bytes: 72245551 num_examples: 1431 download_size: 364099731 dataset_size: 615164760 - config_name: BoolQ features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 14345597 num_examples: 9427 - name: validation num_bytes: 4924902 num_examples: 3270 - name: test num_bytes: 4896812 num_examples: 3245 download_size: 13149213 dataset_size: 24167311 - config_name: CNN-DM features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 174648422.8 num_examples: 20000 - name: validation num_bytes: 113740739 num_examples: 13368 - name: test num_bytes: 98568284 num_examples: 11490 download_size: 229883492 dataset_size: 386957445.8 - config_name: CosmosQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 36305800 num_examples: 25262 - name: test num_bytes: 10832999 num_examples: 6963 - name: validation num_bytes: 4634409 num_examples: 2985 download_size: 19897939 dataset_size: 51773208 - config_name: DROP features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 27847360 num_examples: 10000 - name: validation num_bytes: 23480132 num_examples: 9535 download_size: 15222763 dataset_size: 51327492 - config_name: GovReport features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1868462340 num_examples: 17517 - name: validation num_bytes: 108907327 num_examples: 973 - name: test num_bytes: 100365631 num_examples: 973 download_size: 1000490212 dataset_size: 2077735298 - config_name: HotpotQA features: - name: instruction dtype: string - name: output dtype: string - name: context sequence: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1067020043 num_examples: 90447 - name: validation num_bytes: 88218929 num_examples: 7405 download_size: 678798579 dataset_size: 1155238972 - config_name: LongAlpaca features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 863106151 num_examples: 8937 download_size: 437700336 dataset_size: 863106151 - config_name: MultiNews features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1068760374 num_examples: 44972 download_size: 618956763 dataset_size: 1068760374 - config_name: MultiRC features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 40076422 num_examples: 12025 download_size: 1832158 dataset_size: 40076422 - config_name: NarrativeQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 22628688513 num_examples: 32747 download_size: 10248838935 dataset_size: 22628688513 - config_name: QMsum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 131114039 num_examples: 1257 download_size: 43975950 dataset_size: 131114039 - config_name: Qasper features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 133289850 num_examples: 2567 download_size: 41804240 dataset_size: 133289850 - config_name: Quality features: - name: instruction dtype: string - name: output dtype: string - name: context dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 127865467 num_examples: 2523 download_size: 19667917 dataset_size: 127865467 - config_name: ReCoRD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 52232819.4182468 num_examples: 20000 - name: validation num_bytes: 25851230 num_examples: 10000 - name: test num_bytes: 25710390 num_examples: 10000 download_size: 53870259 dataset_size: 103794439.4182468 - config_name: SQuAD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 161568764 num_examples: 87599 - name: validation num_bytes: 20509388 num_examples: 10570 download_size: 30105194 dataset_size: 182078152 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - split: validation path: BookSum/validation-* - split: test path: BookSum/test-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - split: validation path: BoolQ/validation-* - split: test path: BoolQ/test-* - config_name: CNN-DM data_files: - split: train path: CNN-DM/train-* - split: validation path: CNN-DM/validation-* - split: test path: CNN-DM/test-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - split: test path: CosmosQA/test-* - split: validation path: CosmosQA/validation-* - config_name: DROP data_files: - split: train path: DROP/train-* - split: validation path: DROP/validation-* - config_name: GovReport data_files: - split: train path: GovReport/train-* - split: validation path: GovReport/validation-* - split: test path: GovReport/test-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - split: validation path: HotpotQA/validation-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - split: validation path: ReCoRD/validation-* - split: test path: ReCoRD/test-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - split: validation path: SQuAD/validation-* - config_name: TriviaQA data_files: - split: train path: TriviaQA/train-* - split: validation path: TriviaQA/validation-* - split: test path: TriviaQA/test-* - config_name: XSum data_files: - split: train path: XSum/train-* - split: validation path: XSum/validation-* - split: test path: XSum/test-* ---

This dataset comprises multiple sub-datasets, each with a specific configuration name, features, and splits. Features include context, output, and instruction, among others, and each dataset details the number of examples and bytes for train, validation, and test splits.
提供机构:
MLP-SEMO
原始信息汇总

数据集概述

BigPatent

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 50000个样本,3170511042字节
  • 下载大小: 1285289620字节
  • 数据集大小: 3170511042字节

BookSum

  • 特征:
    • output: 字符串
    • context: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 9600个样本,478923658字节
    • validation: 1484个样本,63995551字节
    • test: 1431个样本,72245551字节
  • 下载大小: 364099731字节
  • 数据集大小: 615164760字节

BoolQ

  • 特征:
    • instruction: 字符串
    • context: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 9427个样本,14345597字节
    • validation: 3270个样本,4924902字节
    • test: 3245个样本,4896812字节
  • 下载大小: 13149213字节
  • 数据集大小: 24167311字节

CNN-DM

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 20000个样本,174648422.8字节
    • validation: 13368个样本,113740739字节
    • test: 11490个样本,98568284字节
  • 下载大小: 229883492字节
  • 数据集大小: 386957445.8字节

CosmosQA

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 25262个样本,36305800字节
    • validation: 2985个样本,4634409字节
    • test: 6963个样本,10832999字节
  • 下载大小: 19897939字节
  • 数据集大小: 51773208字节

DROP

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 10000个样本,27847360字节
    • validation: 9535个样本,23480132字节
  • 下载大小: 15222763字节
  • 数据集大小: 51327492字节

GovReport

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 17517个样本,1868462340字节
    • validation: 973个样本,108907327字节
    • test: 973个样本,100365631字节
  • 下载大小: 1000490212字节
  • 数据集大小: 2077735298字节

HotpotQA

  • 特征:
    • instruction: 字符串
    • output: 字符串
    • context: 序列字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 90447个样本,1067020043字节
    • validation: 7405个样本,88218929字节
  • 下载大小: 678798579字节
  • 数据集大小: 1155238972字节

LongAlpaca

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 8937个样本,863106151字节
  • 下载大小: 437700336字节
  • 数据集大小: 863106151字节

MultiNews

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 44972个样本,1068760374字节
  • 下载大小: 618956763字节
  • 数据集大小: 1068760374字节

MultiRC

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 12025个样本,40076422字节
  • 下载大小: 1832158字节
  • 数据集大小: 40076422字节

NarrativeQA

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 32747个样本,22628688513字节
  • 下载大小: 10248838935字节
  • 数据集大小: 22628688513字节

QMsum

  • 特征:
    • context: 字符串
    • output: 字符串
    • instruction: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 1257个样本,131114039字节
  • 下载大小: 43975950字节
  • 数据集大小: 131114039字节

Qasper

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 2567个样本,133289850字节
  • 下载大小: 41804240字节
  • 数据集大小: 133289850字节

Quality

  • 特征:
    • instruction: 字符串
    • output: 字符串
    • context: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 2523个样本,127865467字节
  • 下载大小: 19667917字节
  • 数据集大小: 127865467字节

ReCoRD

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 20000个样本,52232819.4182468字节
    • validation: 10000个样本,25851230字节
    • test: 10000个样本,25710390字节
  • 下载大小: 53870259字节
  • 数据集大小: 103794439.4182468字节

SQuAD

  • 特征:
    • context: 字符串
    • instruction: 字符串
    • output: 字符串
    • instruction_sentence: 序列字符串
    • context_sentence: 序列字符串
  • 分割:
    • train: 87599个样本,161568764字节
    • validation: 10570个样本,20509388字节
  • 下载大小: 30105194字节
  • 数据集大小: 182078152字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作