rohansolo/BB-Ultrachat-IndicLingual6-12k

Name: rohansolo/BB-Ultrachat-IndicLingual6-12k
Creator: rohansolo
Published: 2023-12-28 22:46:58
License: 暂无描述

Hugging Face2023-12-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: prompt dtype: string - name: prompt_id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: lang dtype: string splits: - name: train num_bytes: 174391775 num_examples: 12000 download_size: 62179568 dataset_size: 174391775 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - question-answering - text-generation size_categories: - 10K<n<100K language: - hi - ml - ta - kn - mr - en --- # BB-Ultrachat-IndicLingual6-12k This dataset is created by [bhaiyabot ai](bhaiyabot.com) to enrich language model training data, especially in the context of Indic languages. code for creation is also open source at https://github.com/ro-hansolo/IndicTrans2HuggingFaceDatasets ## Overview `BB-Ultrachat-IndicLingual6-12k` is a curated dataset comprising 12,000 multi-turn conversations, which are a subset of the larger [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. These conversations have been evenly distributed across six prominent Indic languages, namely English, Hindi, Tamil, Malayalam, Marathi, and Kannada. ## Data Creation The Indic language data in this dataset was generated by translating the chat data from the `HuggingFaceH4/ultrachat_200k` dataset using the advanced translation model IndicTrans2 by AI4Bharat ## Dataset Structure The dataset is structured as follows: - Total Conversations: 12,000 - Languages Covered: 6 (English, Hindi, Tamil, Malayalam, Marathi, Kannada) - Each language: 2,000 conversations ## Objective Goal is to create a dataset with unique conversations, to ensure that model during training is generalising accross lanuages, and not learning tasks such as translation to aid in multi-lingual abiltiies, but to natively solve problems in any language, and hence be lanuage agnostic, and able to generalise better. Hence the focus on 12,000 unique pairs in different lanuages, to ensure no duplication in the dataset, even across languages. Dataset was consequences of various tests and experiments to optimise for peak GPU performance and Efficient Memory usage during translations. ## Usage This dataset is intended for use in fine-tuning models for various experimental purposes ## Acknowledgements Special thanks to the Hugging Face team for providing the original `ultrachat_200k` dataset, and to AI4Bharat of `IndicTrans2` for their state-of-the-art translation model. ``` @article{gala2023indictrans2, title = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages}, author = {Jay Gala and Pranjal A. Chitale and Raghavan AK and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar and Janki Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M. Khapra and Raj Dabre and Anoop Kunchukuttan}, year = {2023}, journal = {Transactions on Machine Learning Research}, url = {https://openreview.net/forum?id=vfT4YuzAYA} } ``` ``` @misc{ding2023enhancing, title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou}, year={2023}, eprint={2305.14233}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

rohansolo

原始信息汇总

BB-Ultrachat-IndicLingual6-12k 数据集概述

数据集结构

特征

prompt: 字符串类型
prompt_id: 字符串类型
messages: 列表类型，包含以下子特征：
- content: 字符串类型
- role: 字符串类型
lang: 字符串类型

数据分割

train: 包含12,000个样本，总字节数为174,391,775

数据大小

下载大小: 62,179,568字节
数据集大小: 174,391,775字节

配置

default: 包含训练数据文件路径为data/train-*

许可证

MIT

任务类别

问答
文本生成

数据规模

10K<n<100K

语言

hi (印地语)
ml (马拉雅拉姆语)
ta (泰米尔语)
kn (卡纳达语)
mr (马拉地语)
en (英语)

数据集详情

概述

BB-Ultrachat-IndicLingual6-12k 是一个包含12,000个多轮对话的数据集，涵盖六种主要印度语言：英语、印地语、泰米尔语、马拉雅拉姆语、马拉地语和卡纳达语。

数据创建

该数据集中的印度语言数据是通过使用AI4Bharat的先进翻译模型IndicTrans2，从HuggingFaceH4/ultrachat_200k数据集中翻译得到的。

数据结构

总对话数: 12,000
涵盖语言: 6种（英语、印地语、泰米尔语、马拉雅拉姆语、马拉地语、卡纳达语）
每种语言对话数: 2,000

目标

旨在创建一个包含独特对话的数据集，确保模型在训练时能够跨语言泛化，而不是学习翻译任务来辅助多语言能力，而是能够本机解决任何语言的问题，从而更好地泛化。

使用

该数据集旨在用于微调模型，进行各种实验。

5,000+

优质数据集

54 个

任务类型

进入经典数据集