MLP-SEMO/IT_data_old
收藏Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-SEMO/IT_data_old
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: BigPatent
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 3170511042
num_examples: 50000
download_size: 1285289620
dataset_size: 3170511042
- config_name: BookSum
features:
- name: output
dtype: string
- name: context
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 478923658
num_examples: 9600
- name: validation
num_bytes: 63995551
num_examples: 1484
- name: test
num_bytes: 72245551
num_examples: 1431
download_size: 364099731
dataset_size: 615164760
- config_name: BoolQ
features:
- name: instruction
dtype: string
- name: context
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 14345597
num_examples: 9427
- name: validation
num_bytes: 4924902
num_examples: 3270
- name: test
num_bytes: 4896812
num_examples: 3245
download_size: 13149213
dataset_size: 24167311
- config_name: CNN-DM
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 174648422.8
num_examples: 20000
- name: validation
num_bytes: 113740739
num_examples: 13368
- name: test
num_bytes: 98568284
num_examples: 11490
download_size: 229883492
dataset_size: 386957445.8
- config_name: CosmosQA
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 36305800
num_examples: 25262
- name: test
num_bytes: 10832999
num_examples: 6963
- name: validation
num_bytes: 4634409
num_examples: 2985
download_size: 19897939
dataset_size: 51773208
- config_name: DROP
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 27847360
num_examples: 10000
- name: validation
num_bytes: 23480132
num_examples: 9535
download_size: 15222763
dataset_size: 51327492
- config_name: GovReport
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1868462340
num_examples: 17517
- name: validation
num_bytes: 108907327
num_examples: 973
- name: test
num_bytes: 100365631
num_examples: 973
download_size: 1000490212
dataset_size: 2077735298
- config_name: HotpotQA
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: context
sequence: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1067020043
num_examples: 90447
- name: validation
num_bytes: 88218929
num_examples: 7405
download_size: 678798579
dataset_size: 1155238972
- config_name: LongAlpaca
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 863106151
num_examples: 8937
download_size: 437700336
dataset_size: 863106151
- config_name: MultiNews
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1068760374
num_examples: 44972
download_size: 618956763
dataset_size: 1068760374
- config_name: MultiRC
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 40076422
num_examples: 12025
download_size: 1832158
dataset_size: 40076422
- config_name: NarrativeQA
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 22628688513
num_examples: 32747
download_size: 10248838935
dataset_size: 22628688513
- config_name: QMsum
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 131114039
num_examples: 1257
download_size: 43975950
dataset_size: 131114039
- config_name: Qasper
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 133289850
num_examples: 2567
download_size: 41804240
dataset_size: 133289850
- config_name: Quality
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: context
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 127865467
num_examples: 2523
download_size: 19667917
dataset_size: 127865467
- config_name: ReCoRD
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 52232819.4182468
num_examples: 20000
- name: validation
num_bytes: 25851230
num_examples: 10000
- name: test
num_bytes: 25710390
num_examples: 10000
download_size: 53870259
dataset_size: 103794439.4182468
- config_name: SQuAD
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: instruction_sentence
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 161568764
num_examples: 87599
- name: validation
num_bytes: 20509388
num_examples: 10570
download_size: 30105194
dataset_size: 182078152
configs:
- config_name: BigPatent
data_files:
- split: train
path: BigPatent/train-*
- config_name: BookSum
data_files:
- split: train
path: BookSum/train-*
- split: validation
path: BookSum/validation-*
- split: test
path: BookSum/test-*
- config_name: BoolQ
data_files:
- split: train
path: BoolQ/train-*
- split: validation
path: BoolQ/validation-*
- split: test
path: BoolQ/test-*
- config_name: CNN-DM
data_files:
- split: train
path: CNN-DM/train-*
- split: validation
path: CNN-DM/validation-*
- split: test
path: CNN-DM/test-*
- config_name: CosmosQA
data_files:
- split: train
path: CosmosQA/train-*
- split: test
path: CosmosQA/test-*
- split: validation
path: CosmosQA/validation-*
- config_name: DROP
data_files:
- split: train
path: DROP/train-*
- split: validation
path: DROP/validation-*
- config_name: GovReport
data_files:
- split: train
path: GovReport/train-*
- split: validation
path: GovReport/validation-*
- split: test
path: GovReport/test-*
- config_name: HotpotQA
data_files:
- split: train
path: HotpotQA/train-*
- split: validation
path: HotpotQA/validation-*
- config_name: LongAlpaca
data_files:
- split: train
path: LongAlpaca/train-*
- config_name: MultiNews
data_files:
- split: train
path: MultiNews/train-*
- config_name: MultiRC
data_files:
- split: train
path: MultiRC/train-*
- config_name: NarrativeQA
data_files:
- split: train
path: NarrativeQA/train-*
- config_name: QMsum
data_files:
- split: train
path: QMsum/train-*
- config_name: Qasper
data_files:
- split: train
path: Qasper/train-*
- config_name: Quality
data_files:
- split: train
path: Quality/train-*
- config_name: ReCoRD
data_files:
- split: train
path: ReCoRD/train-*
- split: validation
path: ReCoRD/validation-*
- split: test
path: ReCoRD/test-*
- config_name: SQuAD
data_files:
- split: train
path: SQuAD/train-*
- split: validation
path: SQuAD/validation-*
- config_name: TriviaQA
data_files:
- split: train
path: TriviaQA/train-*
- split: validation
path: TriviaQA/validation-*
- split: test
path: TriviaQA/test-*
- config_name: XSum
data_files:
- split: train
path: XSum/train-*
- split: validation
path: XSum/validation-*
- split: test
path: XSum/test-*
---
This dataset comprises multiple sub-datasets, each with a specific configuration name, features, and splits. Features include context, output, and instruction, among others, and each dataset details the number of examples and bytes for train, validation, and test splits.
提供机构:
MLP-SEMO
原始信息汇总
数据集概述
BigPatent
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 50000个样本,3170511042字节
- 下载大小: 1285289620字节
- 数据集大小: 3170511042字节
BookSum
- 特征:
- output: 字符串
- context: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 9600个样本,478923658字节
- validation: 1484个样本,63995551字节
- test: 1431个样本,72245551字节
- 下载大小: 364099731字节
- 数据集大小: 615164760字节
BoolQ
- 特征:
- instruction: 字符串
- context: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 9427个样本,14345597字节
- validation: 3270个样本,4924902字节
- test: 3245个样本,4896812字节
- 下载大小: 13149213字节
- 数据集大小: 24167311字节
CNN-DM
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 20000个样本,174648422.8字节
- validation: 13368个样本,113740739字节
- test: 11490个样本,98568284字节
- 下载大小: 229883492字节
- 数据集大小: 386957445.8字节
CosmosQA
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 25262个样本,36305800字节
- validation: 2985个样本,4634409字节
- test: 6963个样本,10832999字节
- 下载大小: 19897939字节
- 数据集大小: 51773208字节
DROP
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 10000个样本,27847360字节
- validation: 9535个样本,23480132字节
- 下载大小: 15222763字节
- 数据集大小: 51327492字节
GovReport
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 17517个样本,1868462340字节
- validation: 973个样本,108907327字节
- test: 973个样本,100365631字节
- 下载大小: 1000490212字节
- 数据集大小: 2077735298字节
HotpotQA
- 特征:
- instruction: 字符串
- output: 字符串
- context: 序列字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 90447个样本,1067020043字节
- validation: 7405个样本,88218929字节
- 下载大小: 678798579字节
- 数据集大小: 1155238972字节
LongAlpaca
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 8937个样本,863106151字节
- 下载大小: 437700336字节
- 数据集大小: 863106151字节
MultiNews
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 44972个样本,1068760374字节
- 下载大小: 618956763字节
- 数据集大小: 1068760374字节
MultiRC
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 12025个样本,40076422字节
- 下载大小: 1832158字节
- 数据集大小: 40076422字节
NarrativeQA
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 32747个样本,22628688513字节
- 下载大小: 10248838935字节
- 数据集大小: 22628688513字节
QMsum
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 1257个样本,131114039字节
- 下载大小: 43975950字节
- 数据集大小: 131114039字节
Qasper
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 2567个样本,133289850字节
- 下载大小: 41804240字节
- 数据集大小: 133289850字节
Quality
- 特征:
- instruction: 字符串
- output: 字符串
- context: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 2523个样本,127865467字节
- 下载大小: 19667917字节
- 数据集大小: 127865467字节
ReCoRD
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 20000个样本,52232819.4182468字节
- validation: 10000个样本,25851230字节
- test: 10000个样本,25710390字节
- 下载大小: 53870259字节
- 数据集大小: 103794439.4182468字节
SQuAD
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- instruction_sentence: 序列字符串
- context_sentence: 序列字符串
- 分割:
- train: 87599个样本,161568764字节
- validation: 10570个样本,20509388字节
- 下载大小: 30105194字节
- 数据集大小: 182078152字节



