catallama/Catalan-Instruct
收藏Hugging Face2024-05-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/catallama/Catalan-Instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: category
dtype: string
- name: language
dtype: string
splits:
- name: train
num_bytes: 450534851
num_examples: 311849
- name: test
num_bytes: 23785759
num_examples: 16414
download_size: 241922171
dataset_size: 474320610
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: cc-by-sa-4.0
task_categories:
- text-generation
language:
- ca
- en
size_categories:
- 100K<n<1M
pretty_name: Catalan Instruct
---
### Dataset Summary
The Catalan Instruct Dataset contains **328k sample instructions** totalling **114M tokens** after tokenizing it with the [Llama-3 Tokenizer](https://huggingface.co/meta-llama/Meta-Llama-3-8B).
The dataset is a collection of **samples from existing datasets** and **new data generated synthetically** with ChatGPT 3.5
Some sampled datasets were used as is, and some were **augmented** with ChatGPT 3.5
It is licensed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license since many instructions are an augmentation of datasets with this license.
### Tasks
- Information extraction (suitable for RAG)
- Named Entity Recognition (NER)
- Translation from English to Catalan and Catalan to English
- Summarization - both short form and long form
- Chat
- Sentiment analysis
- Open question answering
#### Data Sources - Notable Mentions
- [projecte-aina/InstruCAT](https://huggingface.co/datasets/projecte-aina/InstruCAT)
- This dataset was split by category, and some of the categories were augmented with ChatGPT 3.5, others were kept as is and some were discarded
- [projecte-aina/RAG_Multilingual](https://huggingface.co/datasets/projecte-aina/RAG_Multilingual)
- This entire dataset was augmented with ChatGPT 3.5 to make the answers more verbose and `chat-like`
- Only examples in Catalan were selected
- Other notable datasets from [projecte-aina](https://huggingface.co/projecte-aina) are `sentiment analysis`, `summarization`, `NER`
- **Wizard Dataset** is where the English instructions were sampled from
### Languages
Catalan (`ca-ES`) - 70%
English (`en-US`) - 30%
### Data Splits
The dataset contains two splits: `train` and `test`.
### Contributions
Thanks to [projecte-aina](https://huggingface.co/projecte-aina) for providing parts of the original dataset. Please visit their page to see all their available datasets.
Thanks to the Wizard team for providing the English samples.
提供机构:
catallama
原始信息汇总
数据集概述
名称: Catalan Instruct
样本数量: 总计约328k样本
令牌数量: 总计约114M令牌
语言: 主要为Catalan (ca-ES) - 70%,其次为English (en-US) - 30%
数据来源: 包含现有数据集样本及使用ChatGPT 3.5生成的合成新数据
数据增强: 部分数据集样本使用ChatGPT 3.5进行增强
许可证: Creative Commons Attribution 4.0 International
任务类别: 信息提取、命名实体识别、翻译、摘要、聊天、情感分析、开放式问答
数据分割: 分为train和test两个部分
数据集特征
- messages:
- content: 字符串类型
- role: 字符串类型
- category: 字符串类型
- language: 字符串类型
数据集大小
- 下载大小: 241922171字节
- 数据集大小: 474320610字节
- 训练集:
- 大小: 450534851字节
- 样本数: 311849
- 测试集:
- 大小: 23785759字节
- 样本数: 16414
数据文件配置
- 默认配置:
- 训练数据路径: data/train-*
- 测试数据路径: data/test-*
贡献者
- projecte-aina
- Wizard团队



