rohansolo/BB_HindiHinglishV2

Name: rohansolo/BB_HindiHinglishV2
Creator: rohansolo
Published: 2023-12-31 09:04:46
License: 暂无描述

Hugging Face2023-12-31 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rohansolo/BB_HindiHinglishV2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train_sft num_bytes: 533044539 num_examples: 199137 - name: test_sft num_bytes: 132486609 num_examples: 49785 download_size: 263949334 dataset_size: 665531148 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* license: cc-by-nc-4.0 language: - hi - en --- Overview This dataset is a comprehensive collection of popular Hindi instruction-type datasets. It has been meticulously curated and merged into a unified format, making it ideal for use with Hugging Face's alignment notebook. The primary objective of creating this dataset is to offer a single, standardized resource for training models in understanding and generating Hindi and Hinglish (Hindi-English) conversations. Data Sources The dataset is an amalgamation of several individual datasets, each sourced from the Hugging Face datasets library. These include: FreedomIntelligence/evol-instruct-hindi (Train Split) NebulaByte/alpaca-gpt4-hindi-hinglish (Train Split) FreedomIntelligence/evol-instruct-hindi (Train Split, used twice in the script) smangrul/hindi_instruct_v1 (Train and Test Splits) SherryT997/HelpSteer-hindi (Train Split) Data Processing The datasets were processed using custom Python scripts. The process involved: Loading each dataset from Hugging Face. Applying specific conversion functions (convert_dataset1 and convert_dataset2) to standardize the format of the datasets. These functions were designed to handle different data formats and unify them under a common structure. Merging the converted datasets into a single Pandas DataFrame. Splitting the merged dataset into training and testing sets using a 80/20 split. Converting these splits back into Hugging Face Dataset format for ease of use in training and evaluation. Dataset Structure The final dataset is structured as follows: Each entry consists of a unique id and a series of messages. Each message contains content and a role (either 'user' or 'assistant') indicating the speaker. Purpose The dataset is intended for research and development in natural language processing, specifically for: Training models on Hindi and Hinglish conversation understanding. Enhancing conversational AI capabilities in Hindi and mixed-language contexts. Usage This dataset is particularly suited for use with Hugging Face's alignment notebook. It can be utilized for training language models that cater to Hindi-speaking users, offering a rich source of conversational data in both Hindi and Hinglish.

提供机构：

rohansolo

原始信息汇总

数据集概述

数据集信息

特征:
- id: 数据类型为字符串。
- messages: 列表类型，包含以下子特征:
  - content: 数据类型为字符串。
  - role: 数据类型为字符串。
- category: 数据类型为字符串。
- __index_level_0__: 数据类型为int64。
分割:
- train_sft: 字节数为533044539，样本数为199137。
- test_sft: 字节数为132486609，样本数为49785。
下载大小: 263949334字节。
数据集大小: 665531148字节。
配置:
- default: 包含训练和测试数据文件路径。
许可证: cc-by-nc-4.0。
语言: 包含印地语和英语。

数据来源

数据集由多个来自Hugging Face数据集库的独立数据集合并而成，包括:
- FreedomIntelligence/evol-instruct-hindi (训练分割)
- NebulaByte/alpaca-gpt4-hindi-hinglish (训练分割)
- smangrul/hindi_instruct_v1 (训练和测试分割)
- SherryT997/HelpSteer-hindi (训练分割)

数据处理

使用自定义Python脚本处理数据集，包括:
- 从Hugging Face加载每个数据集。
- 应用特定的转换函数(convert_dataset1和convert_dataset2)以标准化数据格式。
- 合并转换后的数据集为一个Pandas DataFrame。
- 按80/20比例分割合并后的数据集为训练和测试集。
- 将分割后的数据集转换回Hugging Face数据集格式。

数据集结构

每个条目包含一个唯一的id和一系列messages。
每个message包含content和role(用户或助手)。

目的

用于自然语言处理研究和发展，特别是:
- 训练模型理解印地语和印英混合语言对话。
- 增强印地语和混合语言环境下的对话AI能力。

使用

适用于Hugging Face的对齐笔记本，可用于训练面向印地语用户的语言模型。

5,000+

优质数据集

54 个

任务类型

进入经典数据集