MLP-SEMO/IT_data_old

Name: MLP-SEMO/IT_data_old
Creator: MLP-SEMO
Published: 2024-05-23 15:21:28
License: 暂无描述

Hugging Face2024-05-23 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/MLP-SEMO/IT_data_old

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: BigPatent features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 3170511042 num_examples: 50000 download_size: 1285289620 dataset_size: 3170511042 - config_name: BookSum features: - name: output dtype: string - name: context dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 478923658 num_examples: 9600 - name: validation num_bytes: 63995551 num_examples: 1484 - name: test num_bytes: 72245551 num_examples: 1431 download_size: 364099731 dataset_size: 615164760 - config_name: BoolQ features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 14345597 num_examples: 9427 - name: validation num_bytes: 4924902 num_examples: 3270 - name: test num_bytes: 4896812 num_examples: 3245 download_size: 13149213 dataset_size: 24167311 - config_name: CNN-DM features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 174648422.8 num_examples: 20000 - name: validation num_bytes: 113740739 num_examples: 13368 - name: test num_bytes: 98568284 num_examples: 11490 download_size: 229883492 dataset_size: 386957445.8 - config_name: CosmosQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 36305800 num_examples: 25262 - name: test num_bytes: 10832999 num_examples: 6963 - name: validation num_bytes: 4634409 num_examples: 2985 download_size: 19897939 dataset_size: 51773208 - config_name: DROP features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 27847360 num_examples: 10000 - name: validation num_bytes: 23480132 num_examples: 9535 download_size: 15222763 dataset_size: 51327492 - config_name: GovReport features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1868462340 num_examples: 17517 - name: validation num_bytes: 108907327 num_examples: 973 - name: test num_bytes: 100365631 num_examples: 973 download_size: 1000490212 dataset_size: 2077735298 - config_name: HotpotQA features: - name: instruction dtype: string - name: output dtype: string - name: context sequence: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1067020043 num_examples: 90447 - name: validation num_bytes: 88218929 num_examples: 7405 download_size: 678798579 dataset_size: 1155238972 - config_name: LongAlpaca features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 863106151 num_examples: 8937 download_size: 437700336 dataset_size: 863106151 - config_name: MultiNews features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1068760374 num_examples: 44972 download_size: 618956763 dataset_size: 1068760374 - config_name: MultiRC features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 40076422 num_examples: 12025 download_size: 1832158 dataset_size: 40076422 - config_name: NarrativeQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 22628688513 num_examples: 32747 download_size: 10248838935 dataset_size: 22628688513 - config_name: QMsum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 131114039 num_examples: 1257 download_size: 43975950 dataset_size: 131114039 - config_name: Qasper features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 133289850 num_examples: 2567 download_size: 41804240 dataset_size: 133289850 - config_name: Quality features: - name: instruction dtype: string - name: output dtype: string - name: context dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 127865467 num_examples: 2523 download_size: 19667917 dataset_size: 127865467 - config_name: ReCoRD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 52232819.4182468 num_examples: 20000 - name: validation num_bytes: 25851230 num_examples: 10000 - name: test num_bytes: 25710390 num_examples: 10000 download_size: 53870259 dataset_size: 103794439.4182468 - config_name: SQuAD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: instruction_sentence sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 161568764 num_examples: 87599 - name: validation num_bytes: 20509388 num_examples: 10570 download_size: 30105194 dataset_size: 182078152 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - split: validation path: BookSum/validation-* - split: test path: BookSum/test-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - split: validation path: BoolQ/validation-* - split: test path: BoolQ/test-* - config_name: CNN-DM data_files: - split: train path: CNN-DM/train-* - split: validation path: CNN-DM/validation-* - split: test path: CNN-DM/test-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - split: test path: CosmosQA/test-* - split: validation path: CosmosQA/validation-* - config_name: DROP data_files: - split: train path: DROP/train-* - split: validation path: DROP/validation-* - config_name: GovReport data_files: - split: train path: GovReport/train-* - split: validation path: GovReport/validation-* - split: test path: GovReport/test-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - split: validation path: HotpotQA/validation-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - split: validation path: ReCoRD/validation-* - split: test path: ReCoRD/test-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - split: validation path: SQuAD/validation-* - config_name: TriviaQA data_files: - split: train path: TriviaQA/train-* - split: validation path: TriviaQA/validation-* - split: test path: TriviaQA/test-* - config_name: XSum data_files: - split: train path: XSum/train-* - split: validation path: XSum/validation-* - split: test path: XSum/test-* ---

This dataset comprises multiple sub-datasets, each with a specific configuration name, features, and splits. Features include context, output, and instruction, among others, and each dataset details the number of examples and bytes for train, validation, and test splits.

提供机构：

MLP-SEMO

原始信息汇总

数据集概述

BigPatent

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 50000个样本，3170511042字节
下载大小: 1285289620字节
数据集大小: 3170511042字节

BookSum

特征:
- output: 字符串
- context: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 9600个样本，478923658字节
- validation: 1484个样本，63995551字节
- test: 1431个样本，72245551字节
下载大小: 364099731字节
数据集大小: 615164760字节

BoolQ

特征:
- instruction: 字符串
- context: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 9427个样本，14345597字节
- validation: 3270个样本，4924902字节
- test: 3245个样本，4896812字节
下载大小: 13149213字节
数据集大小: 24167311字节

CNN-DM

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 20000个样本，174648422.8字节
- validation: 13368个样本，113740739字节
- test: 11490个样本，98568284字节
下载大小: 229883492字节
数据集大小: 386957445.8字节

CosmosQA

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 25262个样本，36305800字节
- validation: 2985个样本，4634409字节
- test: 6963个样本，10832999字节
下载大小: 19897939字节
数据集大小: 51773208字节

DROP

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 10000个样本，27847360字节
- validation: 9535个样本，23480132字节
下载大小: 15222763字节
数据集大小: 51327492字节

GovReport

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 17517个样本，1868462340字节
- validation: 973个样本，108907327字节
- test: 973个样本，100365631字节
下载大小: 1000490212字节
数据集大小: 2077735298字节

HotpotQA

特征:
- instruction: 字符串
- output: 字符串
- context: 序列字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 90447个样本，1067020043字节
- validation: 7405个样本，88218929字节
下载大小: 678798579字节
数据集大小: 1155238972字节

LongAlpaca

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 8937个样本，863106151字节
下载大小: 437700336字节
数据集大小: 863106151字节

MultiNews

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 44972个样本，1068760374字节
下载大小: 618956763字节
数据集大小: 1068760374字节

MultiRC

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 12025个样本，40076422字节
下载大小: 1832158字节
数据集大小: 40076422字节

NarrativeQA

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 32747个样本，22628688513字节
下载大小: 10248838935字节
数据集大小: 22628688513字节

QMsum

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 1257个样本，131114039字节
下载大小: 43975950字节
数据集大小: 131114039字节

Qasper

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 2567个样本，133289850字节
下载大小: 41804240字节
数据集大小: 133289850字节

Quality

特征:
- instruction: 字符串
- output: 字符串
- context: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 2523个样本，127865467字节
下载大小: 19667917字节
数据集大小: 127865467字节

ReCoRD

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 20000个样本，52232819.4182468字节
- validation: 10000个样本，25851230字节
- test: 10000个样本，25710390字节
下载大小: 53870259字节
数据集大小: 103794439.4182468字节

SQuAD

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
分割:
- train: 87599个样本，161568764字节
- validation: 10570个样本，20509388字节
下载大小: 30105194字节
数据集大小: 182078152字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集