MLP-SEMO/IT_datasets
收藏Hugging Face2024-05-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-SEMO/IT_datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: BigPatent
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 3056976240
num_examples: 50000
download_size: 1272481947
dataset_size: 3056976240
- config_name: BookSum
features:
- name: output
dtype: string
- name: context
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 476317871
num_examples: 9600
- name: validation
num_bytes: 63750022
num_examples: 1484
- name: test
num_bytes: 71934433
num_examples: 1431
download_size: 363438025
dataset_size: 612002326
- config_name: BoolQ
features:
- name: instruction
dtype: string
- name: context
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 12853661
num_examples: 9427
- name: validation
num_bytes: 4410696
num_examples: 3270
- name: test
num_bytes: 4386660
num_examples: 3245
download_size: 12607245
dataset_size: 21651017
- config_name: CNN-DM
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 863427270
num_examples: 100000
- name: validation
num_bytes: 112409140
num_examples: 13368
- name: test
num_bytes: 97428338
num_examples: 11490
download_size: 652210853
dataset_size: 1073264748
- config_name: CosmosQA
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 26971402
num_examples: 25262
- name: test
num_bytes: 8004035
num_examples: 6963
- name: validation
num_bytes: 3421792
num_examples: 2985
download_size: 15773346
dataset_size: 38397229
- config_name: DROP
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 26295805
num_examples: 10000
- name: validation
num_bytes: 21971393
num_examples: 9535
download_size: 14620500
dataset_size: 48267198
- config_name: GovReport
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1868279205
num_examples: 17517
- name: validation
num_bytes: 108895887
num_examples: 973
- name: test
num_bytes: 100349892
num_examples: 973
download_size: 1001067529
dataset_size: 2077524984
- config_name: HotpotQA
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: context
sequence: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1048709644
num_examples: 90447
- name: validation
num_bytes: 86820188
num_examples: 7405
download_size: 671874576
dataset_size: 1135529832
- config_name: LongAlpaca
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 853542732
num_examples: 8937
download_size: 436405773
dataset_size: 853542732
- config_name: MultiNews
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 1063566489
num_examples: 44972
download_size: 618228866
dataset_size: 1063566489
- config_name: MultiRC
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 39225163
num_examples: 12025
download_size: 1607874
dataset_size: 39225163
- config_name: NarrativeQA
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 21589984139
num_examples: 32747
download_size: 10012303798
dataset_size: 21589984139
- config_name: QMsum
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 130848491
num_examples: 1257
download_size: 43933066
dataset_size: 130848491
- config_name: Qasper
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 132355093
num_examples: 2567
download_size: 41697628
dataset_size: 132355093
- config_name: Quality
features:
- name: instruction
dtype: string
- name: output
dtype: string
- name: context
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 125840855
num_examples: 2523
download_size: 18496863
dataset_size: 125840855
- config_name: ReCoRD
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 238582023
num_examples: 100730
- name: validation
num_bytes: 23362645
num_examples: 10000
- name: test
num_bytes: 23217267
num_examples: 10000
download_size: 118874876
dataset_size: 285161935
- config_name: SQuAD
features:
- name: context
dtype: string
- name: instruction
dtype: string
- name: output
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 148929842
num_examples: 87599
- name: validation
num_bytes: 18980126
num_examples: 10570
download_size: 26422546
dataset_size: 167909968
- config_name: XSum
features:
- name: context
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: context_sentence
sequence: string
splits:
- name: train
num_bytes: 467899254
num_examples: 100000
- name: validation
num_bytes: 52521877
num_examples: 11332
- name: test
num_bytes: 53464225
num_examples: 11334
download_size: 358953960
dataset_size: 573885356
configs:
- config_name: BigPatent
data_files:
- split: train
path: BigPatent/train-*
- config_name: BookSum
data_files:
- split: train
path: BookSum/train-*
- split: validation
path: BookSum/validation-*
- split: test
path: BookSum/test-*
- config_name: BoolQ
data_files:
- split: train
path: BoolQ/train-*
- split: validation
path: BoolQ/validation-*
- split: test
path: BoolQ/test-*
- config_name: CNN-DM
data_files:
- split: train
path: CNN-DM/train-*
- split: validation
path: CNN-DM/validation-*
- split: test
path: CNN-DM/test-*
- config_name: CosmosQA
data_files:
- split: train
path: CosmosQA/train-*
- split: test
path: CosmosQA/test-*
- split: validation
path: CosmosQA/validation-*
- config_name: DROP
data_files:
- split: train
path: DROP/train-*
- split: validation
path: DROP/validation-*
- config_name: GovReport
data_files:
- split: train
path: GovReport/train-*
- split: validation
path: GovReport/validation-*
- split: test
path: GovReport/test-*
- config_name: HotpotQA
data_files:
- split: train
path: HotpotQA/train-*
- split: validation
path: HotpotQA/validation-*
- config_name: LongAlpaca
data_files:
- split: train
path: LongAlpaca/train-*
- config_name: MultiNews
data_files:
- split: train
path: MultiNews/train-*
- config_name: MultiRC
data_files:
- split: train
path: MultiRC/train-*
- config_name: NarrativeQA
data_files:
- split: train
path: NarrativeQA/train-*
- config_name: QMsum
data_files:
- split: train
path: QMsum/train-*
- config_name: Qasper
data_files:
- split: train
path: Qasper/train-*
- config_name: Quality
data_files:
- split: train
path: Quality/train-*
- config_name: ReCoRD
data_files:
- split: train
path: ReCoRD/train-*
- split: validation
path: ReCoRD/validation-*
- split: test
path: ReCoRD/test-*
- config_name: SQuAD
data_files:
- split: train
path: SQuAD/train-*
- split: validation
path: SQuAD/validation-*
- config_name: XSum
data_files:
- split: train
path: XSum/train-*
- split: validation
path: XSum/validation-*
- split: test
path: XSum/test-*
---
The provided README content outlines multiple datasets, each with specific configurations, features, and splits. Each dataset configuration includes details such as the name, features (like context, output, instruction, and context_sentence), and the splits (train, validation, test) with corresponding number of examples and bytes. The README also provides information about the download and dataset sizes for each configuration.
提供机构:
MLP-SEMO
原始信息汇总
数据集概述
BigPatent
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 50000个样本,3056976240字节
- 下载大小: 1272481947字节
- 数据集大小: 3056976240字节
BookSum
- 特征:
- output: 字符串
- context: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 9600个样本,476317871字节
- validation: 1484个样本,63750022字节
- test: 1431个样本,71934433字节
- 下载大小: 363438025字节
- 数据集大小: 612002326字节
BoolQ
- 特征:
- instruction: 字符串
- context: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 9427个样本,12853661字节
- validation: 3270个样本,4410696字节
- test: 3245个样本,4386660字节
- 下载大小: 12607245字节
- 数据集大小: 21651017字节
CNN-DM
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 100000个样本,863427270字节
- validation: 13368个样本,112409140字节
- test: 11490个样本,97428338字节
- 下载大小: 652210853字节
- 数据集大小: 1073264748字节
CosmosQA
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 25262个样本,26971402字节
- validation: 2985个样本,3421792字节
- test: 6963个样本,8004035字节
- 下载大小: 15773346字节
- 数据集大小: 38397229字节
DROP
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 10000个样本,26295805字节
- validation: 9535个样本,21971393字节
- 下载大小: 14620500字节
- 数据集大小: 48267198字节
GovReport
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 17517个样本,1868279205字节
- validation: 973个样本,108895887字节
- test: 973个样本,100349892字节
- 下载大小: 1001067529字节
- 数据集大小: 2077524984字节
HotpotQA
- 特征:
- instruction: 字符串
- output: 字符串
- context: 字符串序列
- context_sentence: 字符串序列
- 分割:
- train: 90447个样本,1048709644字节
- validation: 7405个样本,86820188字节
- 下载大小: 671874576字节
- 数据集大小: 1135529832字节
LongAlpaca
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 8937个样本,853542732字节
- 下载大小: 436405773字节
- 数据集大小: 853542732字节
MultiNews
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 44972个样本,1063566489字节
- 下载大小: 618228866字节
- 数据集大小: 1063566489字节
MultiRC
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 12025个样本,39225163字节
- 下载大小: 1607874字节
- 数据集大小: 39225163字节
NarrativeQA
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 32747个样本,21589984139字节
- 下载大小: 10012303798字节
- 数据集大小: 21589984139字节
QMsum
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 1257个样本,130848491字节
- 下载大小: 43933066字节
- 数据集大小: 130848491字节
Qasper
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 2567个样本,132355093字节
- 下载大小: 41697628字节
- 数据集大小: 132355093字节
Quality
- 特征:
- instruction: 字符串
- output: 字符串
- context: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 2523个样本,125840855字节
- 下载大小: 18496863字节
- 数据集大小: 125840855字节
ReCoRD
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 100730个样本,238582023字节
- validation: 10000个样本,23362645字节
- test: 10000个样本,23217267字节
- 下载大小: 118874876字节
- 数据集大小: 285161935字节
SQuAD
- 特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 87599个样本,148929842字节
- validation: 10570个样本,18980126字节
- 下载大小: 26422546字节
- 数据集大小: 167909968字节
XSum
- 特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
- 分割:
- train: 100000个样本,467899254字节
- validation: 11332个样本,52521877字节
- test: 11334个样本,53464225字节
- 下载大小: 358953960字节
- 数据集大小: 573885356字节
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



