MLP-SEMO/IT_datasets

Name: MLP-SEMO/IT_datasets
Creator: MLP-SEMO
Published: 2024-05-30 04:14:45
License: 暂无描述

Hugging Face2024-05-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/MLP-SEMO/IT_datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: BigPatent features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 3056976240 num_examples: 50000 download_size: 1272481947 dataset_size: 3056976240 - config_name: BookSum features: - name: output dtype: string - name: context dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 476317871 num_examples: 9600 - name: validation num_bytes: 63750022 num_examples: 1484 - name: test num_bytes: 71934433 num_examples: 1431 download_size: 363438025 dataset_size: 612002326 - config_name: BoolQ features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 12853661 num_examples: 9427 - name: validation num_bytes: 4410696 num_examples: 3270 - name: test num_bytes: 4386660 num_examples: 3245 download_size: 12607245 dataset_size: 21651017 - config_name: CNN-DM features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 863427270 num_examples: 100000 - name: validation num_bytes: 112409140 num_examples: 13368 - name: test num_bytes: 97428338 num_examples: 11490 download_size: 652210853 dataset_size: 1073264748 - config_name: CosmosQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 26971402 num_examples: 25262 - name: test num_bytes: 8004035 num_examples: 6963 - name: validation num_bytes: 3421792 num_examples: 2985 download_size: 15773346 dataset_size: 38397229 - config_name: DROP features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 26295805 num_examples: 10000 - name: validation num_bytes: 21971393 num_examples: 9535 download_size: 14620500 dataset_size: 48267198 - config_name: GovReport features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1868279205 num_examples: 17517 - name: validation num_bytes: 108895887 num_examples: 973 - name: test num_bytes: 100349892 num_examples: 973 download_size: 1001067529 dataset_size: 2077524984 - config_name: HotpotQA features: - name: instruction dtype: string - name: output dtype: string - name: context sequence: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1048709644 num_examples: 90447 - name: validation num_bytes: 86820188 num_examples: 7405 download_size: 671874576 dataset_size: 1135529832 - config_name: LongAlpaca features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 853542732 num_examples: 8937 download_size: 436405773 dataset_size: 853542732 - config_name: MultiNews features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 1063566489 num_examples: 44972 download_size: 618228866 dataset_size: 1063566489 - config_name: MultiRC features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 39225163 num_examples: 12025 download_size: 1607874 dataset_size: 39225163 - config_name: NarrativeQA features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 21589984139 num_examples: 32747 download_size: 10012303798 dataset_size: 21589984139 - config_name: QMsum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 130848491 num_examples: 1257 download_size: 43933066 dataset_size: 130848491 - config_name: Qasper features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 132355093 num_examples: 2567 download_size: 41697628 dataset_size: 132355093 - config_name: Quality features: - name: instruction dtype: string - name: output dtype: string - name: context dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 125840855 num_examples: 2523 download_size: 18496863 dataset_size: 125840855 - config_name: ReCoRD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 238582023 num_examples: 100730 - name: validation num_bytes: 23362645 num_examples: 10000 - name: test num_bytes: 23217267 num_examples: 10000 download_size: 118874876 dataset_size: 285161935 - config_name: SQuAD features: - name: context dtype: string - name: instruction dtype: string - name: output dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 148929842 num_examples: 87599 - name: validation num_bytes: 18980126 num_examples: 10570 download_size: 26422546 dataset_size: 167909968 - config_name: XSum features: - name: context dtype: string - name: output dtype: string - name: instruction dtype: string - name: context_sentence sequence: string splits: - name: train num_bytes: 467899254 num_examples: 100000 - name: validation num_bytes: 52521877 num_examples: 11332 - name: test num_bytes: 53464225 num_examples: 11334 download_size: 358953960 dataset_size: 573885356 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - split: validation path: BookSum/validation-* - split: test path: BookSum/test-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - split: validation path: BoolQ/validation-* - split: test path: BoolQ/test-* - config_name: CNN-DM data_files: - split: train path: CNN-DM/train-* - split: validation path: CNN-DM/validation-* - split: test path: CNN-DM/test-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - split: test path: CosmosQA/test-* - split: validation path: CosmosQA/validation-* - config_name: DROP data_files: - split: train path: DROP/train-* - split: validation path: DROP/validation-* - config_name: GovReport data_files: - split: train path: GovReport/train-* - split: validation path: GovReport/validation-* - split: test path: GovReport/test-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - split: validation path: HotpotQA/validation-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - split: validation path: ReCoRD/validation-* - split: test path: ReCoRD/test-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - split: validation path: SQuAD/validation-* - config_name: XSum data_files: - split: train path: XSum/train-* - split: validation path: XSum/validation-* - split: test path: XSum/test-* ---

The provided README content outlines multiple datasets, each with specific configurations, features, and splits. Each dataset configuration includes details such as the name, features (like context, output, instruction, and context_sentence), and the splits (train, validation, test) with corresponding number of examples and bytes. The README also provides information about the download and dataset sizes for each configuration.

提供机构：

MLP-SEMO

原始信息汇总

数据集概述

BigPatent

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 50000个样本，3056976240字节
下载大小: 1272481947字节
数据集大小: 3056976240字节

BookSum

特征:
- output: 字符串
- context: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 9600个样本，476317871字节
- validation: 1484个样本，63750022字节
- test: 1431个样本，71934433字节
下载大小: 363438025字节
数据集大小: 612002326字节

BoolQ

特征:
- instruction: 字符串
- context: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 9427个样本，12853661字节
- validation: 3270个样本，4410696字节
- test: 3245个样本，4386660字节
下载大小: 12607245字节
数据集大小: 21651017字节

CNN-DM

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 100000个样本，863427270字节
- validation: 13368个样本，112409140字节
- test: 11490个样本，97428338字节
下载大小: 652210853字节
数据集大小: 1073264748字节

CosmosQA

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 25262个样本，26971402字节
- validation: 2985个样本，3421792字节
- test: 6963个样本，8004035字节
下载大小: 15773346字节
数据集大小: 38397229字节

DROP

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 10000个样本，26295805字节
- validation: 9535个样本，21971393字节
下载大小: 14620500字节
数据集大小: 48267198字节

GovReport

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 17517个样本，1868279205字节
- validation: 973个样本，108895887字节
- test: 973个样本，100349892字节
下载大小: 1001067529字节
数据集大小: 2077524984字节

HotpotQA

特征:
- instruction: 字符串
- output: 字符串
- context: 字符串序列
- context_sentence: 字符串序列
分割:
- train: 90447个样本，1048709644字节
- validation: 7405个样本，86820188字节
下载大小: 671874576字节
数据集大小: 1135529832字节

LongAlpaca

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 8937个样本，853542732字节
下载大小: 436405773字节
数据集大小: 853542732字节

MultiNews

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 44972个样本，1063566489字节
下载大小: 618228866字节
数据集大小: 1063566489字节

MultiRC

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 12025个样本，39225163字节
下载大小: 1607874字节
数据集大小: 39225163字节

NarrativeQA

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 32747个样本，21589984139字节
下载大小: 10012303798字节
数据集大小: 21589984139字节

QMsum

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 1257个样本，130848491字节
下载大小: 43933066字节
数据集大小: 130848491字节

Qasper

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 2567个样本，132355093字节
下载大小: 41697628字节
数据集大小: 132355093字节

Quality

特征:
- instruction: 字符串
- output: 字符串
- context: 字符串
- context_sentence: 字符串序列
分割:
- train: 2523个样本，125840855字节
下载大小: 18496863字节
数据集大小: 125840855字节

ReCoRD

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 100730个样本，238582023字节
- validation: 10000个样本，23362645字节
- test: 10000个样本，23217267字节
下载大小: 118874876字节
数据集大小: 285161935字节

SQuAD

特征:
- context: 字符串
- instruction: 字符串
- output: 字符串
- context_sentence: 字符串序列
分割:
- train: 87599个样本，148929842字节
- validation: 10570个样本，18980126字节
下载大小: 26422546字节
数据集大小: 167909968字节

XSum

特征:
- context: 字符串
- output: 字符串
- instruction: 字符串
- context_sentence: 字符串序列
分割:
- train: 100000个样本，467899254字节
- validation: 11332个样本，52521877字节
- test: 11334个样本，53464225字节
下载大小: 358953960字节
数据集大小: 573885356字节

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集