nicholasKluge/Pt-Corpus-Instruct-tokenized-large

Name: nicholasKluge/Pt-Corpus-Instruct-tokenized-large
Creator: nicholasKluge
Published: 2024-06-18 12:07:34
License: 暂无描述

Hugging Face2024-06-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nicholasKluge/Pt-Corpus-Instruct-tokenized-large

下载链接

链接失效反馈

官方服务：

资源简介：

Portuguese-Corpus Instruct (tokenized large) 数据集是一个使用TeenyTinyLlama tokenizer进行标记化的葡萄牙语语料库。所有序列长度为2048个标记，该数据集用于训练TeenyTinyLlama模型。数据集包含训练集（约3M条数据）和测试集（30K条数据），特征包括input_ids（标记序列）、attention_mask（填充位置的二进制张量）和labels（标记序列）。数据集的语言为葡萄牙语，适用于文本生成任务。

The Portuguese-Corpus Instruct (tokenized large) dataset is a tokenized version of a Portuguese corpus using the TeenyTinyLlama tokenizer. All sequences are 2048 tokens long, and this dataset was used to train the TeenyTinyLlama model. The dataset includes a training set (~3M examples) and a test set (30K examples), with features such as input_ids (sequence of tokens), attention_mask (binary tensor indicating padded positions), and labels (sequence of tokens). The dataset is in Portuguese and is suitable for text-generation tasks.

提供机构：

nicholasKluge

原始信息汇总

葡萄牙语-Corpus Instruct (tokenized large) 数据集概述

数据集描述

数据集摘要

该数据集是 Portuguese-Corpus Instruct 数据集的 tokenized 版本，使用 TeenyTinyLlama tokenizer 进行处理。所有序列长度均为 2048 个 token。该数据集用于 "TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese" 研究中。

语言

葡萄牙语。

数据集结构

数据实例

数据集包含以下特征：

input_ids: 序列的 token 标识。
attention_mask: 指示填充索引位置的二进制张量。
labels: 序列的 token 标识。

数据字段

python { "input_ids": [ 1026, 1531, 1009, 8067,...], "attention_mask": [1, 1, 1, 1, ...], "labels": [ 1026, 1531, 1009, 8067,...] }

数据分割

数据集分为 train（约 300 万条）和 test（3 万条）两个部分。

python from datasets import load_dataset

dataset = load_dataset("nicholasKluge/Pt-Corpus-Instruct-tokenized-large", split=train)

如果不想下载整个数据集，可以设置 streaming 为 `True`

dataset = load_dataset("nicholasKluge/Pt-Corpus-Instruct-tokenized-large", split=train, streaming=True)

附加信息

数据集策展人

Nicholas Kluge Corrêa。

引用信息

latex @misc{correa24ttllama, title = {TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese}, author = {Corr{^e}a, Nicholas Kluge and Falk, Sophia and Fatimah, Shiza and Sen, Aniket and De Oliveira, Nythamar}, journal={arXiv preprint arXiv:2401.16640}, year={2024} }

@misc{correa24ttllama, doi = {10.1016/j.mlwa.2024.100558}, url = {https://www.sciencedirect.com/science/article/pii/S2666827024000343}, title = {TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese}, author = {Corr{^e}a, Nicholas Kluge and Falk, Sophia and Fatimah, Shiza and Sen, Aniket and De Oliveira, Nythamar}, journal={Machine Learning With Applications}, publisher = {Springer}, year={2024} }

贡献

如果您想贡献，请联系 nicholas@airespucrs.org。

5,000+

优质数据集

54 个

任务类型

进入经典数据集