nicholasKluge/Pt-Corpus-Instruct-tokenized-large
收藏葡萄牙语-Corpus Instruct (tokenized large) 数据集概述
数据集描述
数据集摘要
该数据集是 Portuguese-Corpus Instruct 数据集 的 tokenized 版本,使用 TeenyTinyLlama tokenizer 进行处理。所有序列长度均为 2048 个 token。该数据集用于 "TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese" 研究中。
语言
葡萄牙语。
数据集结构
数据实例
数据集包含以下特征:
- input_ids: 序列的 token 标识。
- attention_mask: 指示填充索引位置的二进制张量。
- labels: 序列的 token 标识。
数据字段
python { "input_ids": [ 1026, 1531, 1009, 8067,...], "attention_mask": [1, 1, 1, 1, ...], "labels": [ 1026, 1531, 1009, 8067,...] }
数据分割
数据集分为 train(约 300 万条)和 test(3 万条)两个部分。
python from datasets import load_dataset
dataset = load_dataset("nicholasKluge/Pt-Corpus-Instruct-tokenized-large", split=train)
如果不想下载整个数据集,可以设置 streaming 为 True
dataset = load_dataset("nicholasKluge/Pt-Corpus-Instruct-tokenized-large", split=train, streaming=True)
附加信息
数据集策展人
引用信息
latex @misc{correa24ttllama, title = {TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese}, author = {Corr{^e}a, Nicholas Kluge and Falk, Sophia and Fatimah, Shiza and Sen, Aniket and De Oliveira, Nythamar}, journal={arXiv preprint arXiv:2401.16640}, year={2024} }
@misc{correa24ttllama, doi = {10.1016/j.mlwa.2024.100558}, url = {https://www.sciencedirect.com/science/article/pii/S2666827024000343}, title = {TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese}, author = {Corr{^e}a, Nicholas Kluge and Falk, Sophia and Fatimah, Shiza and Sen, Aniket and De Oliveira, Nythamar}, journal={Machine Learning With Applications}, publisher = {Springer}, year={2024} }
贡献
如果您想贡献,请联系 nicholas@airespucrs.org。



