karanverma19/Advanced_CodeMix_Normalization_Dataset_India
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/karanverma19/Advanced_CodeMix_Normalization_Dataset_India
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
## Evaluation & Benchmarking
To validate dataset usefulness, normalization accuracy can be evaluated using:
- Exact Match Accuracy
- BLEU Score for text similarity
- Human evaluation for real-world correctness
This dataset is designed to improve performance of multilingual NLP systems in handling noisy, code-mixed Indian queries.
## Data Transformation Approach
The dataset was created by transforming real-world code-mixed queries into structured English. Variations include:
- Informal phrasing
- Code-mixed Hindi-English (Hinglish)
- Regional language influence (Punjabi)
- Short and noisy user inputs
This transformation simulates real production-level input data for AI systems.
许可证:Apache-2.0
## 评估与基准测试
为验证本数据集的实用价值,可通过以下方式评估其归一化准确率:
- 精确匹配准确率(Exact Match Accuracy)
- 用于文本相似度评估的BLEU得分(BLEU Score)
- 用于验证真实场景正确性的人工评估(Human evaluation)
本数据集旨在提升多语言自然语言处理(NLP)系统处理带有噪声、代码混合的印度语查询的性能。
## 数据转换方法
本数据集通过将真实场景下的代码混合查询转换为结构化英语构建而成,涵盖的变体类型包括:
- 非正式措辞
- 印英代码混合语(Hinglish)
- 受区域语言影响(旁遮普语)
- 简短且带有噪声的用户输入
该转换过程可模拟AI系统在实际生产环境中所接收的输入数据。
提供机构:
karanverma19



