five

MLP-Lemma/Instruct-datasets-preprocessed-old

收藏
Hugging Face2024-05-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/MLP-Lemma/Instruct-datasets-preprocessed-old
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: BigPatent features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 2102177468 num_examples: 41479 download_size: 429751697 dataset_size: 2102177468 - config_name: BookSum features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 543904716 num_examples: 9409 download_size: 128895816 dataset_size: 543904716 - config_name: BoolQ features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 17514304 num_examples: 9427 download_size: 3477907 dataset_size: 17514304 - config_name: CosmosQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 52847188 num_examples: 25262 download_size: 7706920 dataset_size: 52847188 - config_name: DROP features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 245217428 num_examples: 77204 download_size: 14313116 dataset_size: 245217428 - config_name: HotpotQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1072803600 num_examples: 90208 download_size: 233619848 dataset_size: 1072803600 - config_name: LongAlpaca features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 721290748 num_examples: 7627 download_size: 147562694 dataset_size: 721290748 - config_name: MultiNews features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 1020525220 num_examples: 44351 download_size: 248426552 dataset_size: 1020525220 - config_name: MultiRC features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 42141260 num_examples: 12025 download_size: 2282817 dataset_size: 42141260 - config_name: NarrativeQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 7462691336 num_examples: 13344 download_size: 1598984098 dataset_size: 7462691336 - config_name: QMsum features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 139541092 num_examples: 1257 download_size: 22489762 dataset_size: 139541092 - config_name: Qasper features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 116752512 num_examples: 2545 download_size: 23681139 dataset_size: 116752512 - config_name: Quality features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 134614736 num_examples: 2523 download_size: 24647486 dataset_size: 134614736 - config_name: ReCoRD features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 289867824 num_examples: 100684 download_size: 59223539 dataset_size: 289867824 - config_name: SQuAD features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 185012360 num_examples: 87576 download_size: 31396840 dataset_size: 185012360 - config_name: TriviaQA features: - name: input_ids sequence: int32 - name: input_sentences_ids sequence: sequence: int64 - name: labels sequence: int64 splits: - name: train num_bytes: 4902401628 num_examples: 53503 download_size: 1041259522 dataset_size: 4902401628 configs: - config_name: BigPatent data_files: - split: train path: BigPatent/train-* - config_name: BookSum data_files: - split: train path: BookSum/train-* - config_name: BoolQ data_files: - split: train path: BoolQ/train-* - config_name: CosmosQA data_files: - split: train path: CosmosQA/train-* - config_name: DROP data_files: - split: train path: DROP/train-* - config_name: HotpotQA data_files: - split: train path: HotpotQA/train-* - config_name: LongAlpaca data_files: - split: train path: LongAlpaca/train-* - config_name: MultiNews data_files: - split: train path: MultiNews/train-* - config_name: MultiRC data_files: - split: train path: MultiRC/train-* - config_name: NarrativeQA data_files: - split: train path: NarrativeQA/train-* - config_name: QMsum data_files: - split: train path: QMsum/train-* - config_name: Qasper data_files: - split: train path: Qasper/train-* - config_name: Quality data_files: - split: train path: Quality/train-* - config_name: ReCoRD data_files: - split: train path: ReCoRD/train-* - config_name: SQuAD data_files: - split: train path: SQuAD/train-* - config_name: TriviaQA data_files: - split: train path: TriviaQA/train-* ---

The dataset consists of multiple sub-datasets, each with a specific configuration name and features. The main features include input_ids, input_sentences_ids, and labels, which are used for training models. Each sub-dataset has detailed training set information, including data size, number of examples, and download size.
提供机构:
MLP-Lemma
原始信息汇总

数据集概述

BigPatent

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 2102177468 字节
    • 示例数量: 41479
    • 下载大小: 429751697 字节

BookSum

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 543904716 字节
    • 示例数量: 9409
    • 下载大小: 128895816 字节

BoolQ

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 17514304 字节
    • 示例数量: 9427
    • 下载大小: 3477907 字节

CosmosQA

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 52847188 字节
    • 示例数量: 25262
    • 下载大小: 7706920 字节

DROP

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 245217428 字节
    • 示例数量: 77204
    • 下载大小: 14313116 字节

HotpotQA

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 1072803600 字节
    • 示例数量: 90208
    • 下载大小: 233619848 字节

LongAlpaca

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 721290748 字节
    • 示例数量: 7627
    • 下载大小: 147562694 字节

MultiNews

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 1020525220 字节
    • 示例数量: 44351
    • 下载大小: 248426552 字节

MultiRC

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 42141260 字节
    • 示例数量: 12025
    • 下载大小: 2282817 字节

NarrativeQA

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 7462691336 字节
    • 示例数量: 13344
    • 下载大小: 1598984098 字节

QMsum

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 139541092 字节
    • 示例数量: 1257
    • 下载大小: 22489762 字节

Qasper

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 116752512 字节
    • 示例数量: 2545
    • 下载大小: 23681139 字节

Quality

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 134614736 字节
    • 示例数量: 2523
    • 下载大小: 24647486 字节

ReCoRD

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 289867824 字节
    • 示例数量: 100684
    • 下载大小: 59223539 字节

SQuAD

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 185012360 字节
    • 示例数量: 87576
    • 下载大小: 31396840 字节

TriviaQA

  • 特征:
    • input_ids: 序列类型为 int32
    • input_sentences_ids: 序列类型为 int64
    • labels: 序列类型为 int64
  • 训练集:
    • 数据大小: 4902401628 字节
    • 示例数量: 53503
    • 下载大小: 1041259522 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作