TucanoBR/Tucano-SFT
收藏Hugging Face2024-11-13 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TucanoBR/Tucano-SFT
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
list:
- name: content
dtype: string
- name: role
dtype: string
- name: metadata
dtype: string
splits:
- name: train
num_bytes: 1320671821.7453537
num_examples: 679609
download_size: 681136105
dataset_size: 1320671821.7453537
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: other
task_categories:
- text-generation
language:
- pt
tags:
- portuguese
- language-modeling
- chat
- conversation
- instruction
pretty_name: Tucano-SFT
size_categories:
- 100M<n<1B
---
# Tucano-SFT
<img src="./logo-gigaverbo.png" height="200">
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
## Dataset Description
- **Homepage:** https://huggingface.co/datasets/TucanoBR/Tucano-SFT
- **Repository:** https://huggingface.co/datasets/TucanoBR/Tucano-SFT
- **Paper:** [Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)
- **Point of Contact:** [Nk-correa](mailto:kluge@uni-bonn.de)
### Dataset Summary
This is the dataset used to train the "Instruct" versions of the Tucano series, being a concatenation of three datasets:
- [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean)
- [rhaymison/orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k)
- [nicholasKluge/instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3)
### Supported Tasks and Leaderboards
This dataset can be utilized for tasks involving the aligment of language models.
### Languages
Portuguese
## Dataset Structure
### Data Instances
The dataset consists of the following features:
- **conversations:** a list of dictionaries following a [chat format](https://github.com/huggingface/blog/blob/main/chat-templates.md).
- **metadata:** the source where that string originated.
### Data Fields
```python
{
"conversations": [
{'role': 'user', 'content': 'What is a language model?'},
{'role': 'assistant', 'content': 'A language model is a probability distribution over a vocabulary.'},
]
"metadata": "source: https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3",
}
```
### Data Splits
Available splits are `train`.
```python
from datasets import load_dataset
dataset = load_dataset("TucanoBR/Tucano-SFT", split='train')
# If you don't want to download the entire dataset, set streaming to `True`
dataset = load_dataset("TucanoBR/Tucano-SFT", split='train', streaming=True)
```
## Dataset Creation
### Curation Rationale
This dataset was developed as part of the study "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)".
### Source Data
#### Initial Data Collection and Normalization
This dataset is simply the concatenation of other three datasets:
- [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean)
- [rhaymison/orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k)
- [nicholasKluge/instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3)
#### Who are the source language producers?
All text samples are native to Portuguese or translated from other languages to Portuguese (slight contamination of different languages should also be expected).
### Annotations
#### Annotation process
All conversations were generated by querying already-tuned models.
#### Who are the annotators?
[Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de).
### Personal and Sensitive Information
No personal or sensitive information is part of this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
No considerations.
### Discussion of Biases
No considerations.
### Other Known Limitations
No considerations.
## Additional Information
### Dataset Curators
[Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de).
### Licensing Information
The following datasets and respective licenses apply:
- [GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) (License: [MIT License](https://mit-license.org/))
- [Orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k) (License: [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html))
- [Instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3) (License: [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html))
提供机构:
TucanoBR



