rohansolo/BB_HindiHinglishV2
收藏Hugging Face2023-12-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rohansolo/BB_HindiHinglishV2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: category
dtype: string
- name: __index_level_0__
dtype: int64
splits:
- name: train_sft
num_bytes: 533044539
num_examples: 199137
- name: test_sft
num_bytes: 132486609
num_examples: 49785
download_size: 263949334
dataset_size: 665531148
configs:
- config_name: default
data_files:
- split: train_sft
path: data/train_sft-*
- split: test_sft
path: data/test_sft-*
license: cc-by-nc-4.0
language:
- hi
- en
---
Overview
This dataset is a comprehensive collection of popular Hindi instruction-type datasets. It has been meticulously curated and merged into a unified format, making it ideal for use with Hugging Face's alignment notebook. The primary objective of creating this dataset is to offer a single, standardized resource for training models in understanding and generating Hindi and Hinglish (Hindi-English) conversations.
Data Sources
The dataset is an amalgamation of several individual datasets, each sourced from the Hugging Face datasets library. These include:
FreedomIntelligence/evol-instruct-hindi (Train Split)
NebulaByte/alpaca-gpt4-hindi-hinglish (Train Split)
FreedomIntelligence/evol-instruct-hindi (Train Split, used twice in the script)
smangrul/hindi_instruct_v1 (Train and Test Splits)
SherryT997/HelpSteer-hindi (Train Split)
Data Processing
The datasets were processed using custom Python scripts. The process involved:
Loading each dataset from Hugging Face.
Applying specific conversion functions (convert_dataset1 and convert_dataset2) to standardize the format of the datasets. These functions were designed to handle different data formats and unify them under a common structure.
Merging the converted datasets into a single Pandas DataFrame.
Splitting the merged dataset into training and testing sets using a 80/20 split.
Converting these splits back into Hugging Face Dataset format for ease of use in training and evaluation.
Dataset Structure
The final dataset is structured as follows:
Each entry consists of a unique id and a series of messages.
Each message contains content and a role (either 'user' or 'assistant') indicating the speaker.
Purpose
The dataset is intended for research and development in natural language processing, specifically for:
Training models on Hindi and Hinglish conversation understanding.
Enhancing conversational AI capabilities in Hindi and mixed-language contexts.
Usage
This dataset is particularly suited for use with Hugging Face's alignment notebook. It can be utilized for training language models that cater to Hindi-speaking users, offering a rich source of conversational data in both Hindi and Hinglish.
提供机构:
rohansolo
原始信息汇总
数据集概述
数据集信息
- 特征:
id: 数据类型为字符串。messages: 列表类型,包含以下子特征:content: 数据类型为字符串。role: 数据类型为字符串。
category: 数据类型为字符串。__index_level_0__: 数据类型为int64。
- 分割:
train_sft: 字节数为533044539,样本数为199137。test_sft: 字节数为132486609,样本数为49785。
- 下载大小: 263949334字节。
- 数据集大小: 665531148字节。
- 配置:
default: 包含训练和测试数据文件路径。
- 许可证: cc-by-nc-4.0。
- 语言: 包含印地语和英语。
数据来源
- 数据集由多个来自Hugging Face数据集库的独立数据集合并而成,包括:
FreedomIntelligence/evol-instruct-hindi(训练分割)NebulaByte/alpaca-gpt4-hindi-hinglish(训练分割)smangrul/hindi_instruct_v1(训练和测试分割)SherryT997/HelpSteer-hindi(训练分割)
数据处理
- 使用自定义Python脚本处理数据集,包括:
- 从Hugging Face加载每个数据集。
- 应用特定的转换函数(
convert_dataset1和convert_dataset2)以标准化数据格式。 - 合并转换后的数据集为一个Pandas DataFrame。
- 按80/20比例分割合并后的数据集为训练和测试集。
- 将分割后的数据集转换回Hugging Face数据集格式。
数据集结构
- 每个条目包含一个唯一的
id和一系列messages。 - 每个
message包含content和role(用户或助手)。
目的
- 用于自然语言处理研究和发展,特别是:
- 训练模型理解印地语和印英混合语言对话。
- 增强印地语和混合语言环境下的对话AI能力。
使用
- 适用于Hugging Face的对齐笔记本,可用于训练面向印地语用户的语言模型。



