five

MLP-Lemma/Instruct-datasets-preprocessed_

收藏
Hugging Face2024-05-13 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-Lemma/Instruct-datasets-preprocessed_
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: BigPatent features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 2181931364 num_examples: 41383 download_size: 429873908 dataset_size: 2181931364 - config_name: BookSum features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 559326356 num_examples: 9371 download_size: 131337226 dataset_size: 559326356 - config_name: BoolQ features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 23606064 num_examples: 9426 download_size: 3602973 dataset_size: 23606064 - config_name: CNN-DM features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 986484148 num_examples: 99951 download_size: 223079731 dataset_size: 986484148 - config_name: CosmosQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 68985732 num_examples: 25262 download_size: 8201506 dataset_size: 68985732 - config_name: DROP features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 38608468 num_examples: 9977 download_size: 6309064 dataset_size: 38608468 - config_name: GovReport features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1482525108 num_examples: 16590 download_size: 314318065 dataset_size: 1482525108 - config_name: HotpotQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1153215480 num_examples: 90208 download_size: 237017294 dataset_size: 1153215480 - config_name: LongAlpaca features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 740226008 num_examples: 7573 download_size: 147558794 dataset_size: 740226008 - config_name: MultiNews features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1079685084 num_examples: 44398 download_size: 251453029 dataset_size: 1079685084 - config_name: MultiRC features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 50935944 num_examples: 12025 download_size: 2318491 dataset_size: 50935944 - config_name: NarrativeQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 8619352780 num_examples: 12857 download_size: 1844752860 dataset_size: 8619352780 - config_name: QMsum features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 148773208 num_examples: 1257 download_size: 22545022 dataset_size: 148773208 - config_name: Qasper features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 117966388 num_examples: 2461 download_size: 23118038 dataset_size: 117966388 - config_name: Quality features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 142043368 num_examples: 2523 download_size: 23727320 dataset_size: 142043368 - config_name: ReCoRD features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 356353852 num_examples: 100682 download_size: 60755952 dataset_size: 356353852 - config_name: SQuAD features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 242141716 num_examples: 87580 download_size: 32077152 dataset_size: 242141716 - config_name: TriviaQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 4921073600 num_examples: 52359 download_size: 1017926047 dataset_size: 4921073600 - config_name: XSum features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 571447332 num_examples: 99827 download_size: 124725113 dataset_size: 571447332 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - config_name: CNN-DM data_files: - split: train path: CNN-DM/train-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - config_name: DROP data_files: - split: train path: DROP/train-* - config_name: GovReport data_files: - split: train path: GovReport/train-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - config_name: TriviaQA data_files: - split: train path: TriviaQA/train-* - config_name: XSum data_files: - split: train path: XSum/train-* ---

This dataset includes multiple sub-datasets, each with a specific configuration name, feature list (such as input_ids, input_sentences_ids, and labels), and detailed information about the training set (such as the number of examples and data size). Additionally, the download size and total dataset size for each dataset are listed.
提供机构:
MLP-Lemma
原始信息汇总

数据集概述

BigPatent

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为2181931364字节,示例数量为41383
  • 下载大小: 429873908字节
  • 数据集大小: 2181931364字节

BookSum

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为559326356字节,示例数量为9371
  • 下载大小: 131337226字节
  • 数据集大小: 559326356字节

BoolQ

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为23606064字节,示例数量为9426
  • 下载大小: 3602973字节
  • 数据集大小: 23606064字节

CNN-DM

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为986484148字节,示例数量为99951
  • 下载大小: 223079731字节
  • 数据集大小: 986484148字节

CosmosQA

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为68985732字节,示例数量为25262
  • 下载大小: 8201506字节
  • 数据集大小: 68985732字节

DROP

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为38608468字节,示例数量为9977
  • 下载大小: 6309064字节
  • 数据集大小: 38608468字节

GovReport

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为1482525108字节,示例数量为16590
  • 下载大小: 314318065字节
  • 数据集大小: 1482525108字节

HotpotQA

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为1153215480字节,示例数量为90208
  • 下载大小: 237017294字节
  • 数据集大小: 1153215480字节

LongAlpaca

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为740226008字节,示例数量为7573
  • 下载大小: 147558794字节
  • 数据集大小: 740226008字节

MultiNews

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为1079685084字节,示例数量为44398
  • 下载大小: 251453029字节
  • 数据集大小: 1079685084字节

MultiRC

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为50935944字节,示例数量为12025
  • 下载大小: 2318491字节
  • 数据集大小: 50935944字节

NarrativeQA

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为8619352780字节,示例数量为12857
  • 下载大小: 1844752860字节
  • 数据集大小: 8619352780字节

QMsum

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为148773208字节,示例数量为1257
  • 下载大小: 22545022字节
  • 数据集大小: 148773208字节

Qasper

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为117966388字节,示例数量为2461
  • 下载大小: 23118038字节
  • 数据集大小: 117966388字节

Quality

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为142043368字节,示例数量为2523
  • 下载大小: 23727320字节
  • 数据集大小: 142043368字节

ReCoRD

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为356353852字节,示例数量为100682
  • 下载大小: 60755952字节
  • 数据集大小: 356353852字节

SQuAD

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为242141716字节,示例数量为87580
  • 下载大小: 32077152字节
  • 数据集大小: 242141716字节

TriviaQA

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为4921073600字节,示例数量为52359
  • 下载大小: 1017926047字节
  • 数据集大小: 4921073600字节

XSum

  • 特征:
    • input_ids: 序列类型为int32
    • input_sentences_ids: 序列类型为int64
    • labels: 序列类型为int64
  • 分割:
    • train: 数据大小为571447332字节,示例数量为99827
  • 下载大小: 124725113字节
  • 数据集大小: 571447332字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作