karanverma19/CodeMix_Query_Normalization_India
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/karanverma19/CodeMix_Query_Normalization_India
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# CodeMix Query Normalization (India)
## Overview
This dataset contains code-mixed user queries from Indian contexts, primarily in Hinglish and Punjabi, normalized into clean English. It reflects how users naturally communicate in real-world scenarios by mixing local languages with English.
## Features
- 100 high-quality samples
- Code-mixed queries (Hinglish, Punjabi)
- Clean normalized English outputs
- Real-world, informal user language patterns
- Covers domains such as banking, ecommerce, telecom, and daily interactions
## Dataset Structure
| Column | Description |
|--------|------------|
| noisy_query | Code-mixed user input (informal, real-world text) |
| normalized_query | Clean, structured English version |
| language | Language type (Hinglish / Punjabi) |
| domain | Use case domain (banking, ecommerce, telecom, etc.) |
## Example
| noisy_query | normalized_query |
|------------|-----------------|
| mera order late hai kya karu | My order is delayed, what should I do? |
| paise transfer kaise kare UPI se | How to transfer money using UPI? |
| recharge ho gaya par net nahi chal raha | Recharge completed but internet is not working |
## Use Cases
- Multilingual chatbot training
- Query normalization for search systems
- Intent understanding in low-resource languages
- AI assistants for Indian users
## Motivation
Code-mixed language is widely used in India, but most AI systems struggle to interpret it correctly. This dataset helps bridge that gap by converting informal, mixed-language queries into clean, structured English suitable for machine learning models.
## License
Apache-2.0
--- 许可证:Apache-2.0 ---
# 代码混合查询规范化(印度版)
## 数据集概述
本数据集收录印度真实语境下的代码混合式用户查询,主要涵盖印地语英语混合语(Hinglish)与旁遮普语英语混合语,经规范化处理为标准清晰的英语文本。该数据集还原了真实场景中用户混用本地语言与英语的自然沟通模式。
## 数据集特征
- 包含100条高质量样本
- 覆盖印地语英语混合语、旁遮普语英语混合语两类代码混合查询
- 输出为规范化后的标准清晰英语文本
- 贴合真实场景的非正式用户语言表达范式
- 覆盖银行、电子商务、电信及日常交互等多个应用领域
## 数据集结构
| 列名 | 描述 |
|--------|------------|
| noisy_query | 代码混合式用户输入文本(非正式、真实场景语料) |
| normalized_query | 清晰结构化的标准英语文本 |
| language | 语言类别(印地语英语混合语/旁遮普语) |
| domain | 业务场景领域(银行、电商、电信等) |
## 示例
| noisy_query | normalized_query |
|------------|-----------------|
| mera order late hai kya karu | 我的订单延迟了,我该怎么办? |
| paise transfer kaise kare UPI se | 如何通过统一支付接口(UPI)转账? |
| recharge ho gaya par net nahi chal raha | 充值已完成但网络无法正常使用 |
## 应用场景
- 多语言聊天机器人训练
- 搜索系统的查询规范化处理
- 低资源语言场景下的意图理解
- 面向印度用户的AI助手开发
## 项目动机
代码混合式语言在印度被广泛使用,但多数人工智能系统难以正确理解这类语言表达。本数据集通过将非正式混合语言查询转换为适用于机器学习模型的结构化标准英语文本,有效填补了这一技术缺口。
## 许可证
Apache-2.0
提供机构:
karanverma19



