five

pratultandon/tokenized-recipe-nlg-gpt2

收藏
Hugging Face2022-11-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pratultandon/tokenized-recipe-nlg-gpt2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: input_ids sequence: int32 - name: attention_mask sequence: int8 splits: - name: test num_bytes: 135944246 num_examples: 106202 - name: train num_bytes: 2582090838 num_examples: 2022671 download_size: 805955428 dataset_size: 2718035084 --- # Dataset Card for "tokenized-recipe-nlg-gpt2" This a tokenized version of the recipe-nlg database from https://recipenlg.cs.put.poznan.pl/. The preprocessing on the original csv was done using the methodology of the original paper (best as I could interpret) along with a similar 0.05 percent train test split. The tokenizer used has some special tokens, but all these parameters are accessible in https://huggingface.co/pratultandon/recipe-nlg-gpt2 if you want to recreate. This dataset will save you a lot of time getting started if you want to experiment with training GPT2 on the data yourself.
提供机构:
pratultandon
原始信息汇总

数据集概述

数据集名称

"tokenized-recipe-nlg-gpt2"

数据集特征

  • input_ids

    • 数据类型: int32
    • 序列类型: sequence
  • attention_mask

    • 数据类型: int8
    • 序列类型: sequence

数据集分割

  • 训练集 (train)

    • 样本数量: 2022671
    • 数据大小: 2582090838 字节
  • 测试集 (test)

    • 样本数量: 106202
    • 数据大小: 135944246 字节

数据集大小

  • 下载大小: 805955428 字节
  • 数据集总大小: 2718035084 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作