five

catallama/Catalan-Instruct

收藏
Hugging Face2024-05-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/catallama/Catalan-Instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: language dtype: string splits: - name: train num_bytes: 450534851 num_examples: 311849 - name: test num_bytes: 23785759 num_examples: 16414 download_size: 241922171 dataset_size: 474320610 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: cc-by-sa-4.0 task_categories: - text-generation language: - ca - en size_categories: - 100K<n<1M pretty_name: Catalan Instruct --- ### Dataset Summary The Catalan Instruct Dataset contains **328k sample instructions** totalling **114M tokens** after tokenizing it with the [Llama-3 Tokenizer](https://huggingface.co/meta-llama/Meta-Llama-3-8B). The dataset is a collection of **samples from existing datasets** and **new data generated synthetically** with ChatGPT 3.5 Some sampled datasets were used as is, and some were **augmented** with ChatGPT 3.5 It is licensed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) license since many instructions are an augmentation of datasets with this license. ### Tasks - Information extraction (suitable for RAG) - Named Entity Recognition (NER) - Translation from English to Catalan and Catalan to English - Summarization - both short form and long form - Chat - Sentiment analysis - Open question answering #### Data Sources - Notable Mentions - [projecte-aina/InstruCAT](https://huggingface.co/datasets/projecte-aina/InstruCAT) - This dataset was split by category, and some of the categories were augmented with ChatGPT 3.5, others were kept as is and some were discarded - [projecte-aina/RAG_Multilingual](https://huggingface.co/datasets/projecte-aina/RAG_Multilingual) - This entire dataset was augmented with ChatGPT 3.5 to make the answers more verbose and `chat-like` - Only examples in Catalan were selected - Other notable datasets from [projecte-aina](https://huggingface.co/projecte-aina) are `sentiment analysis`, `summarization`, `NER` - **Wizard Dataset** is where the English instructions were sampled from ### Languages Catalan (`ca-ES`) - 70% English (`en-US`) - 30% ### Data Splits The dataset contains two splits: `train` and `test`. ### Contributions Thanks to [projecte-aina](https://huggingface.co/projecte-aina) for providing parts of the original dataset. Please visit their page to see all their available datasets. Thanks to the Wizard team for providing the English samples.
提供机构:
catallama
原始信息汇总

数据集概述

名称: Catalan Instruct
样本数量: 总计约328k样本
令牌数量: 总计约114M令牌
语言: 主要为Catalan (ca-ES) - 70%,其次为English (en-US) - 30%
数据来源: 包含现有数据集样本及使用ChatGPT 3.5生成的合成新数据
数据增强: 部分数据集样本使用ChatGPT 3.5进行增强
许可证: Creative Commons Attribution 4.0 International
任务类别: 信息提取、命名实体识别、翻译、摘要、聊天、情感分析、开放式问答
数据分割: 分为traintest两个部分

数据集特征

  • messages:
    • content: 字符串类型
    • role: 字符串类型
  • category: 字符串类型
  • language: 字符串类型

数据集大小

  • 下载大小: 241922171字节
  • 数据集大小: 474320610字节
  • 训练集:
    • 大小: 450534851字节
    • 样本数: 311849
  • 测试集:
    • 大小: 23785759字节
    • 样本数: 16414

数据文件配置

  • 默认配置:
    • 训练数据路径: data/train-*
    • 测试数据路径: data/test-*

贡献者

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作