TucanoBR/Tucano-SFT

Name: TucanoBR/Tucano-SFT
Creator: TucanoBR
Published: 2024-11-13 11:17:43
License: 暂无描述

Hugging Face2024-11-13 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TucanoBR/Tucano-SFT

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: conversations list: - name: content dtype: string - name: role dtype: string - name: metadata dtype: string splits: - name: train num_bytes: 1320671821.7453537 num_examples: 679609 download_size: 681136105 dataset_size: 1320671821.7453537 configs: - config_name: default data_files: - split: train path: data/train-* license: other task_categories: - text-generation language: - pt tags: - portuguese - language-modeling - chat - conversation - instruction pretty_name: Tucano-SFT size_categories: - 100M<n<1B --- # Tucano-SFT <img src="./logo-gigaverbo.png" height="200"> ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/TucanoBR/Tucano-SFT - **Repository:** https://huggingface.co/datasets/TucanoBR/Tucano-SFT - **Paper:** [Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854) - **Point of Contact:** [Nk-correa](mailto:kluge@uni-bonn.de) ### Dataset Summary This is the dataset used to train the "Instruct" versions of the Tucano series, being a concatenation of three datasets: - [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) - [rhaymison/orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k) - [nicholasKluge/instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3) ### Supported Tasks and Leaderboards This dataset can be utilized for tasks involving the aligment of language models. ### Languages Portuguese ## Dataset Structure ### Data Instances The dataset consists of the following features: - **conversations:** a list of dictionaries following a [chat format](https://github.com/huggingface/blog/blob/main/chat-templates.md). - **metadata:** the source where that string originated. ### Data Fields ```python { "conversations": [ {'role': 'user', 'content': 'What is a language model?'}, {'role': 'assistant', 'content': 'A language model is a probability distribution over a vocabulary.'}, ] "metadata": "source: https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3", } ``` ### Data Splits Available splits are `train`. ```python from datasets import load_dataset dataset = load_dataset("TucanoBR/Tucano-SFT", split='train') # If you don't want to download the entire dataset, set streaming to `True` dataset = load_dataset("TucanoBR/Tucano-SFT", split='train', streaming=True) ``` ## Dataset Creation ### Curation Rationale This dataset was developed as part of the study "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)". ### Source Data #### Initial Data Collection and Normalization This dataset is simply the concatenation of other three datasets: - [cnmoro/GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) - [rhaymison/orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k) - [nicholasKluge/instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3) #### Who are the source language producers? All text samples are native to Portuguese or translated from other languages to Portuguese (slight contamination of different languages should also be expected). ### Annotations #### Annotation process All conversations were generated by querying already-tuned models. #### Who are the annotators? [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). ### Personal and Sensitive Information No personal or sensitive information is part of this dataset. ## Considerations for Using the Data ### Social Impact of Dataset No considerations. ### Discussion of Biases No considerations. ### Other Known Limitations No considerations. ## Additional Information ### Dataset Curators [Nicholas Kluge Corrêa](mailto:kluge@uni-bonn.de). ### Licensing Information The following datasets and respective licenses apply: - [GPT4-500k-Augmented-PTBR-Clean](https://huggingface.co/datasets/cnmoro/GPT4-500k-Augmented-PTBR-Clean) (License: [MIT License](https://mit-license.org/)) - [Orca-math-portuguese-64k](https://huggingface.co/datasets/rhaymison/orca-math-portuguese-64k) (License: [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)) - [Instruct-aira-dataset-v3](https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v3) (License: [Apache License, version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html))

提供机构：

TucanoBR

5,000+

优质数据集

54 个

任务类型

进入经典数据集