catallama/Catalan-Instruct

Name: catallama/Catalan-Instruct
Creator: catallama
Published: 2024-05-26 09:52:35
License: 暂无描述

Hugging Face2024-05-26 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/catallama/Catalan-Instruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: language dtype: string splits: - name: train num_bytes: 450534851 num_examples: 311849 - name: test num_bytes: 23785759 num_examples: 16414 download_size: 241922171 dataset_size: 474320610 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-sa-4.0 task_categories: - text-generation language: - ca - en size_categories: - 100K<n<1M pretty_name: Catalan Instruct --- ### Dataset Summary The Catalan Instruct Dataset contains **328k sample instructions** totalling **114M tokens** after tokenizing it with the [Llama-3 Tokenizer](https://huggingface.co/meta-llama/Meta-Llama-3-8B). The dataset is a collection of **samples from existing datasets** and **new data generated synthetically** with ChatGPT 3.5 Some sampled datasets were used as is, and some were **augmented** with ChatGPT 3.5 It is licensed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license since many instructions are an augmentation of datasets with this license. ### Tasks - Information extraction (suitable for RAG) - Named Entity Recognition (NER) - Translation from English to Catalan and Catalan to English - Summarization - both short form and long form - Chat - Sentiment analysis - Open question answering #### Data Sources - Notable Mentions - [projecte-aina/InstruCAT](https://huggingface.co/datasets/projecte-aina/InstruCAT) - This dataset was split by category, and some of the categories were augmented with ChatGPT 3.5, others were kept as is and some were discarded - [projecte-aina/RAG_Multilingual](https://huggingface.co/datasets/projecte-aina/RAG_Multilingual) - This entire dataset was augmented with ChatGPT 3.5 to make the answers more verbose and `chat-like` - Only examples in Catalan were selected - Other notable datasets from [projecte-aina](https://huggingface.co/projecte-aina) are `sentiment analysis`, `summarization`, `NER` - **Wizard Dataset** is where the English instructions were sampled from ### Languages Catalan (`ca-ES`) - 70% English (`en-US`) - 30% ### Data Splits The dataset contains two splits: `train` and `test`. ### Contributions Thanks to [projecte-aina](https://huggingface.co/projecte-aina) for providing parts of the original dataset. Please visit their page to see all their available datasets. Thanks to the Wizard team for providing the English samples.

提供机构：

catallama

原始信息汇总

数据集概述

名称: Catalan Instruct
样本数量: 总计约328k样本
令牌数量: 总计约114M令牌
语言: 主要为Catalan (ca-ES) - 70%，其次为English (en-US) - 30%
数据来源: 包含现有数据集样本及使用ChatGPT 3.5生成的合成新数据
数据增强: 部分数据集样本使用ChatGPT 3.5进行增强
许可证: Creative Commons Attribution 4.0 International
任务类别: 信息提取、命名实体识别、翻译、摘要、聊天、情感分析、开放式问答
数据分割: 分为train和test两个部分

数据集特征

messages:
- content: 字符串类型
- role: 字符串类型
category: 字符串类型
language: 字符串类型

数据集大小

下载大小: 241922171字节
数据集大小: 474320610字节
训练集:
- 大小: 450534851字节
- 样本数: 311849
测试集:
- 大小: 23785759字节
- 样本数: 16414

数据文件配置

默认配置:
- 训练数据路径: data/train-*
- 测试数据路径: data/test-*

贡献者

projecte-aina
Wizard团队

5,000+

优质数据集

54 个

任务类型

进入经典数据集