MLP-Lemma/Instruct-datasets-preprocessed_
收藏Hugging Face2024-05-13 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-Lemma/Instruct-datasets-preprocessed_
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: BigPatent
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 2181931364
num_examples: 41383
download_size: 429873908
dataset_size: 2181931364
- config_name: BookSum
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 559326356
num_examples: 9371
download_size: 131337226
dataset_size: 559326356
- config_name: BoolQ
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 23606064
num_examples: 9426
download_size: 3602973
dataset_size: 23606064
- config_name: CNN-DM
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 986484148
num_examples: 99951
download_size: 223079731
dataset_size: 986484148
- config_name: CosmosQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 68985732
num_examples: 25262
download_size: 8201506
dataset_size: 68985732
- config_name: DROP
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 38608468
num_examples: 9977
download_size: 6309064
dataset_size: 38608468
- config_name: GovReport
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1482525108
num_examples: 16590
download_size: 314318065
dataset_size: 1482525108
- config_name: HotpotQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1153215480
num_examples: 90208
download_size: 237017294
dataset_size: 1153215480
- config_name: LongAlpaca
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 740226008
num_examples: 7573
download_size: 147558794
dataset_size: 740226008
- config_name: MultiNews
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 1079685084
num_examples: 44398
download_size: 251453029
dataset_size: 1079685084
- config_name: MultiRC
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 50935944
num_examples: 12025
download_size: 2318491
dataset_size: 50935944
- config_name: NarrativeQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 8619352780
num_examples: 12857
download_size: 1844752860
dataset_size: 8619352780
- config_name: QMsum
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 148773208
num_examples: 1257
download_size: 22545022
dataset_size: 148773208
- config_name: Qasper
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 117966388
num_examples: 2461
download_size: 23118038
dataset_size: 117966388
- config_name: Quality
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 142043368
num_examples: 2523
download_size: 23727320
dataset_size: 142043368
- config_name: ReCoRD
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 356353852
num_examples: 100682
download_size: 60755952
dataset_size: 356353852
- config_name: SQuAD
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 242141716
num_examples: 87580
download_size: 32077152
dataset_size: 242141716
- config_name: TriviaQA
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 4921073600
num_examples: 52359
download_size: 1017926047
dataset_size: 4921073600
- config_name: XSum
features:
- name: input_ids
sequence: int32
- name: input_sentences_ids
sequence:
sequence: int64
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 571447332
num_examples: 99827
download_size: 124725113
dataset_size: 571447332
configs:
- config_name: BigPatent
data_files:
- split: train
path: BigPatent/train-*
- config_name: BookSum
data_files:
- split: train
path: BookSum/train-*
- config_name: BoolQ
data_files:
- split: train
path: BoolQ/train-*
- config_name: CNN-DM
data_files:
- split: train
path: CNN-DM/train-*
- config_name: CosmosQA
data_files:
- split: train
path: CosmosQA/train-*
- config_name: DROP
data_files:
- split: train
path: DROP/train-*
- config_name: GovReport
data_files:
- split: train
path: GovReport/train-*
- config_name: HotpotQA
data_files:
- split: train
path: HotpotQA/train-*
- config_name: LongAlpaca
data_files:
- split: train
path: LongAlpaca/train-*
- config_name: MultiNews
data_files:
- split: train
path: MultiNews/train-*
- config_name: MultiRC
data_files:
- split: train
path: MultiRC/train-*
- config_name: NarrativeQA
data_files:
- split: train
path: NarrativeQA/train-*
- config_name: QMsum
data_files:
- split: train
path: QMsum/train-*
- config_name: Qasper
data_files:
- split: train
path: Qasper/train-*
- config_name: Quality
data_files:
- split: train
path: Quality/train-*
- config_name: ReCoRD
data_files:
- split: train
path: ReCoRD/train-*
- config_name: SQuAD
data_files:
- split: train
path: SQuAD/train-*
- config_name: TriviaQA
data_files:
- split: train
path: TriviaQA/train-*
- config_name: XSum
data_files:
- split: train
path: XSum/train-*
---
This dataset includes multiple sub-datasets, each with a specific configuration name, feature list (such as input_ids, input_sentences_ids, and labels), and detailed information about the training set (such as the number of examples and data size). Additionally, the download size and total dataset size for each dataset are listed.
提供机构:
MLP-Lemma
原始信息汇总
数据集概述
BigPatent
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为2181931364字节,示例数量为41383
- 下载大小: 429873908字节
- 数据集大小: 2181931364字节
BookSum
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为559326356字节,示例数量为9371
- 下载大小: 131337226字节
- 数据集大小: 559326356字节
BoolQ
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为23606064字节,示例数量为9426
- 下载大小: 3602973字节
- 数据集大小: 23606064字节
CNN-DM
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为986484148字节,示例数量为99951
- 下载大小: 223079731字节
- 数据集大小: 986484148字节
CosmosQA
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为68985732字节,示例数量为25262
- 下载大小: 8201506字节
- 数据集大小: 68985732字节
DROP
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为38608468字节,示例数量为9977
- 下载大小: 6309064字节
- 数据集大小: 38608468字节
GovReport
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为1482525108字节,示例数量为16590
- 下载大小: 314318065字节
- 数据集大小: 1482525108字节
HotpotQA
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为1153215480字节,示例数量为90208
- 下载大小: 237017294字节
- 数据集大小: 1153215480字节
LongAlpaca
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为740226008字节,示例数量为7573
- 下载大小: 147558794字节
- 数据集大小: 740226008字节
MultiNews
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为1079685084字节,示例数量为44398
- 下载大小: 251453029字节
- 数据集大小: 1079685084字节
MultiRC
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为50935944字节,示例数量为12025
- 下载大小: 2318491字节
- 数据集大小: 50935944字节
NarrativeQA
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为8619352780字节,示例数量为12857
- 下载大小: 1844752860字节
- 数据集大小: 8619352780字节
QMsum
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为148773208字节,示例数量为1257
- 下载大小: 22545022字节
- 数据集大小: 148773208字节
Qasper
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为117966388字节,示例数量为2461
- 下载大小: 23118038字节
- 数据集大小: 117966388字节
Quality
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为142043368字节,示例数量为2523
- 下载大小: 23727320字节
- 数据集大小: 142043368字节
ReCoRD
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为356353852字节,示例数量为100682
- 下载大小: 60755952字节
- 数据集大小: 356353852字节
SQuAD
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为242141716字节,示例数量为87580
- 下载大小: 32077152字节
- 数据集大小: 242141716字节
TriviaQA
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为4921073600字节,示例数量为52359
- 下载大小: 1017926047字节
- 数据集大小: 4921073600字节
XSum
- 特征:
- input_ids: 序列类型为int32
- input_sentences_ids: 序列类型为int64
- labels: 序列类型为int64
- 分割:
- train: 数据大小为571447332字节,示例数量为99827
- 下载大小: 124725113字节
- 数据集大小: 571447332字节



