rohansolo/BB-Ultrachat-IndicLingual6-12k
收藏Hugging Face2023-12-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rohansolo/BB-Ultrachat-IndicLingual6-12k
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: prompt_id
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
splits:
- name: train
num_bytes: 174391775
num_examples: 12000
download_size: 62179568
dataset_size: 174391775
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: mit
task_categories:
- question-answering
- text-generation
size_categories:
- 10K<n<100K
language:
- hi
- ml
- ta
- kn
- mr
- en
---
# BB-Ultrachat-IndicLingual6-12k
This dataset is created by [bhaiyabot ai](bhaiyabot.com) to enrich language model training data, especially in the context of Indic languages. code for creation is also open source at https://github.com/ro-hansolo/IndicTrans2HuggingFaceDatasets
## Overview
`BB-Ultrachat-IndicLingual6-12k` is a curated dataset comprising 12,000 multi-turn conversations, which are a subset of the larger [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. These conversations have been evenly distributed across six prominent Indic languages, namely English, Hindi, Tamil, Malayalam, Marathi, and Kannada.
## Data Creation
The Indic language data in this dataset was generated by translating the chat data from the `HuggingFaceH4/ultrachat_200k` dataset using the advanced translation model IndicTrans2 by AI4Bharat
## Dataset Structure
The dataset is structured as follows:
- Total Conversations: 12,000
- Languages Covered: 6 (English, Hindi, Tamil, Malayalam, Marathi, Kannada)
- Each language: 2,000 conversations
## Objective
Goal is to create a dataset with unique conversations, to ensure that model during training is generalising accross lanuages, and not learning tasks such as translation to aid in multi-lingual abiltiies, but to natively solve problems in any language, and hence be lanuage agnostic, and able to generalise better. Hence the focus on 12,000 unique pairs in different lanuages, to ensure no duplication in the dataset, even across languages.
Dataset was consequences of various tests and experiments to optimise for peak GPU performance and Efficient Memory usage during translations.
## Usage
This dataset is intended for use in fine-tuning models for various experimental purposes
## Acknowledgements
Special thanks to the Hugging Face team for providing the original `ultrachat_200k` dataset, and to AI4Bharat of `IndicTrans2` for their state-of-the-art translation model.
```
@article{gala2023indictrans2,
title = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author = {Jay Gala and Pranjal A. Chitale and Raghavan AK and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar and Janki Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M. Khapra and Raj Dabre and Anoop Kunchukuttan},
year = {2023},
journal = {Transactions on Machine Learning Research},
url = {https://openreview.net/forum?id=vfT4YuzAYA}
}
```
```
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
rohansolo
原始信息汇总
BB-Ultrachat-IndicLingual6-12k 数据集概述
数据集结构
特征
- prompt: 字符串类型
- prompt_id: 字符串类型
- messages: 列表类型,包含以下子特征:
- content: 字符串类型
- role: 字符串类型
- lang: 字符串类型
数据分割
- train: 包含12,000个样本,总字节数为174,391,775
数据大小
- 下载大小: 62,179,568字节
- 数据集大小: 174,391,775字节
配置
- default: 包含训练数据文件路径为
data/train-*
许可证
- MIT
任务类别
- 问答
- 文本生成
数据规模
- 10K<n<100K
语言
- hi (印地语)
- ml (马拉雅拉姆语)
- ta (泰米尔语)
- kn (卡纳达语)
- mr (马拉地语)
- en (英语)
数据集详情
概述
- BB-Ultrachat-IndicLingual6-12k 是一个包含12,000个多轮对话的数据集,涵盖六种主要印度语言:英语、印地语、泰米尔语、马拉雅拉姆语、马拉地语和卡纳达语。
数据创建
- 该数据集中的印度语言数据是通过使用AI4Bharat的先进翻译模型IndicTrans2,从
HuggingFaceH4/ultrachat_200k数据集中翻译得到的。
数据结构
- 总对话数: 12,000
- 涵盖语言: 6种(英语、印地语、泰米尔语、马拉雅拉姆语、马拉地语、卡纳达语)
- 每种语言对话数: 2,000
目标
- 旨在创建一个包含独特对话的数据集,确保模型在训练时能够跨语言泛化,而不是学习翻译任务来辅助多语言能力,而是能够本机解决任何语言的问题,从而更好地泛化。
使用
- 该数据集旨在用于微调模型,进行各种实验。



