five

rohansolo/BB_HindiHinglishV2

收藏
Hugging Face2023-12-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rohansolo/BB_HindiHinglishV2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: messages list: - name: content dtype: string - name: role dtype: string - name: category dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train_sft num_bytes: 533044539 num_examples: 199137 - name: test_sft num_bytes: 132486609 num_examples: 49785 download_size: 263949334 dataset_size: 665531148 configs: - config_name: default data_files: - split: train_sft path: data/train_sft-* - split: test_sft path: data/test_sft-* license: cc-by-nc-4.0 language: - hi - en --- Overview This dataset is a comprehensive collection of popular Hindi instruction-type datasets. It has been meticulously curated and merged into a unified format, making it ideal for use with Hugging Face's alignment notebook. The primary objective of creating this dataset is to offer a single, standardized resource for training models in understanding and generating Hindi and Hinglish (Hindi-English) conversations. Data Sources The dataset is an amalgamation of several individual datasets, each sourced from the Hugging Face datasets library. These include: FreedomIntelligence/evol-instruct-hindi (Train Split) NebulaByte/alpaca-gpt4-hindi-hinglish (Train Split) FreedomIntelligence/evol-instruct-hindi (Train Split, used twice in the script) smangrul/hindi_instruct_v1 (Train and Test Splits) SherryT997/HelpSteer-hindi (Train Split) Data Processing The datasets were processed using custom Python scripts. The process involved: Loading each dataset from Hugging Face. Applying specific conversion functions (convert_dataset1 and convert_dataset2) to standardize the format of the datasets. These functions were designed to handle different data formats and unify them under a common structure. Merging the converted datasets into a single Pandas DataFrame. Splitting the merged dataset into training and testing sets using a 80/20 split. Converting these splits back into Hugging Face Dataset format for ease of use in training and evaluation. Dataset Structure The final dataset is structured as follows: Each entry consists of a unique id and a series of messages. Each message contains content and a role (either 'user' or 'assistant') indicating the speaker. Purpose The dataset is intended for research and development in natural language processing, specifically for: Training models on Hindi and Hinglish conversation understanding. Enhancing conversational AI capabilities in Hindi and mixed-language contexts. Usage This dataset is particularly suited for use with Hugging Face's alignment notebook. It can be utilized for training language models that cater to Hindi-speaking users, offering a rich source of conversational data in both Hindi and Hinglish.
提供机构:
rohansolo
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • id: 数据类型为字符串。
    • messages: 列表类型,包含以下子特征:
      • content: 数据类型为字符串。
      • role: 数据类型为字符串。
    • category: 数据类型为字符串。
    • __index_level_0__: 数据类型为int64。
  • 分割:
    • train_sft: 字节数为533044539,样本数为199137。
    • test_sft: 字节数为132486609,样本数为49785。
  • 下载大小: 263949334字节。
  • 数据集大小: 665531148字节。
  • 配置:
    • default: 包含训练和测试数据文件路径。
  • 许可证: cc-by-nc-4.0。
  • 语言: 包含印地语和英语。

数据来源

  • 数据集由多个来自Hugging Face数据集库的独立数据集合并而成,包括:
    • FreedomIntelligence/evol-instruct-hindi (训练分割)
    • NebulaByte/alpaca-gpt4-hindi-hinglish (训练分割)
    • smangrul/hindi_instruct_v1 (训练和测试分割)
    • SherryT997/HelpSteer-hindi (训练分割)

数据处理

  • 使用自定义Python脚本处理数据集,包括:
    • 从Hugging Face加载每个数据集。
    • 应用特定的转换函数(convert_dataset1convert_dataset2)以标准化数据格式。
    • 合并转换后的数据集为一个Pandas DataFrame。
    • 按80/20比例分割合并后的数据集为训练和测试集。
    • 将分割后的数据集转换回Hugging Face数据集格式。

数据集结构

  • 每个条目包含一个唯一的id和一系列messages
  • 每个message包含contentrole(用户或助手)。

目的

  • 用于自然语言处理研究和发展,特别是:
    • 训练模型理解印地语和印英混合语言对话。
    • 增强印地语和混合语言环境下的对话AI能力。

使用

  • 适用于Hugging Face的对齐笔记本,可用于训练面向印地语用户的语言模型。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作