MLP-Lemma/Instruct-datasets-preprocessed-old
收藏Hugging Face2024-05-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-Lemma/Instruct-datasets-preprocessed-old
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: BigPatent
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 2102177468
num_examples: 41479
download_size: 429751697
dataset_size: 2102177468
- config_name: BookSum
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 543904716
num_examples: 9409
download_size: 128895816
dataset_size: 543904716
- config_name: BoolQ
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 17514304
num_examples: 9427
download_size: 3477907
dataset_size: 17514304
- config_name: CosmosQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 52847188
num_examples: 25262
download_size: 7706920
dataset_size: 52847188
- config_name: DROP
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 245217428
num_examples: 77204
download_size: 14313116
dataset_size: 245217428
- config_name: HotpotQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1072803600
num_examples: 90208
download_size: 233619848
dataset_size: 1072803600
- config_name: LongAlpaca
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 721290748
num_examples: 7627
download_size: 147562694
dataset_size: 721290748
- config_name: MultiNews
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1020525220
num_examples: 44351
download_size: 248426552
dataset_size: 1020525220
- config_name: MultiRC
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 42141260
num_examples: 12025
download_size: 2282817
dataset_size: 42141260
- config_name: NarrativeQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 7462691336
num_examples: 13344
download_size: 1598984098
dataset_size: 7462691336
- config_name: QMsum
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 139541092
num_examples: 1257
download_size: 22489762
dataset_size: 139541092
- config_name: Qasper
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 116752512
num_examples: 2545
download_size: 23681139
dataset_size: 116752512
- config_name: Quality
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 134614736
num_examples: 2523
download_size: 24647486
dataset_size: 134614736
- config_name: ReCoRD
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 289867824
num_examples: 100684
download_size: 59223539
dataset_size: 289867824
- config_name: SQuAD
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 185012360
num_examples: 87576
download_size: 31396840
dataset_size: 185012360
- config_name: TriviaQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 4902401628
num_examples: 53503
download_size: 1041259522
dataset_size: 4902401628
configs:
- config_name: BigPatent
data_files:
- split: train
path: BigPatent/train-*
- config_name: BookSum
data_files:
- split: train
path: BookSum/train-*
- config_name: BoolQ
data_files:
- split: train
path: BoolQ/train-*
- config_name: CosmosQA
data_files:
- split: train
path: CosmosQA/train-*
- config_name: DROP
data_files:
- split: train
path: DROP/train-*
- config_name: HotpotQA
data_files:
- split: train
path: HotpotQA/train-*
- config_name: LongAlpaca
data_files:
- split: train
path: LongAlpaca/train-*
- config_name: MultiNews
data_files:
- split: train
path: MultiNews/train-*
- config_name: MultiRC
data_files:
- split: train
path: MultiRC/train-*
- config_name: NarrativeQA
data_files:
- split: train
path: NarrativeQA/train-*
- config_name: QMsum
data_files:
- split: train
path: QMsum/train-*
- config_name: Qasper
data_files:
- split: train
path: Qasper/train-*
- config_name: Quality
data_files:
- split: train
path: Quality/train-*
- config_name: ReCoRD
data_files:
- split: train
path: ReCoRD/train-*
- config_name: SQuAD
data_files:
- split: train
path: SQuAD/train-*
- config_name: TriviaQA
data_files:
- split: train
path: TriviaQA/train-*
---
The dataset consists of multiple sub-datasets, each with a specific configuration name and features. The main features include input_ids, input_sentences_ids, and labels, which are used for training models. Each sub-dataset has detailed training set information, including data size, number of examples, and download size.
提供机构:
MLP-Lemma
原始信息汇总
数据集概述
BigPatent
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 2102177468 字节
- 示例数量: 41479
- 下载大小: 429751697 字节
BookSum
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 543904716 字节
- 示例数量: 9409
- 下载大小: 128895816 字节
BoolQ
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 17514304 字节
- 示例数量: 9427
- 下载大小: 3477907 字节
CosmosQA
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 52847188 字节
- 示例数量: 25262
- 下载大小: 7706920 字节
DROP
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 245217428 字节
- 示例数量: 77204
- 下载大小: 14313116 字节
HotpotQA
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 1072803600 字节
- 示例数量: 90208
- 下载大小: 233619848 字节
LongAlpaca
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 721290748 字节
- 示例数量: 7627
- 下载大小: 147562694 字节
MultiNews
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 1020525220 字节
- 示例数量: 44351
- 下载大小: 248426552 字节
MultiRC
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 42141260 字节
- 示例数量: 12025
- 下载大小: 2282817 字节
NarrativeQA
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 7462691336 字节
- 示例数量: 13344
- 下载大小: 1598984098 字节
QMsum
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 139541092 字节
- 示例数量: 1257
- 下载大小: 22489762 字节
Qasper
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 116752512 字节
- 示例数量: 2545
- 下载大小: 23681139 字节
Quality
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 134614736 字节
- 示例数量: 2523
- 下载大小: 24647486 字节
ReCoRD
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 289867824 字节
- 示例数量: 100684
- 下载大小: 59223539 字节
SQuAD
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 185012360 字节
- 示例数量: 87576
- 下载大小: 31396840 字节
TriviaQA
- 特征:
- input_ids: 序列类型为 int32
- input_sentences_ids: 序列类型为 int64
- labels: 序列类型为 int64
- 训练集:
- 数据大小: 4902401628 字节
- 示例数量: 53503
- 下载大小: 1041259522 字节



