karanverma19/Advanced_CodeMix_Normalization_Dataset_India

Name: karanverma19/Advanced_CodeMix_Normalization_Dataset_India
Creator: karanverma19
Published: 2026-04-10 07:09:16
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/karanverma19/Advanced_CodeMix_Normalization_Dataset_India

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- ## Evaluation & Benchmarking To validate dataset usefulness, normalization accuracy can be evaluated using: - Exact Match Accuracy - BLEU Score for text similarity - Human evaluation for real-world correctness This dataset is designed to improve performance of multilingual NLP systems in handling noisy, code-mixed Indian queries. ## Data Transformation Approach The dataset was created by transforming real-world code-mixed queries into structured English. Variations include: - Informal phrasing - Code-mixed Hindi-English (Hinglish) - Regional language influence (Punjabi) - Short and noisy user inputs This transformation simulates real production-level input data for AI systems.

许可证：Apache-2.0 ## 评估与基准测试为验证本数据集的实用价值，可通过以下方式评估其归一化准确率： - 精确匹配准确率（Exact Match Accuracy） - 用于文本相似度评估的BLEU得分（BLEU Score） - 用于验证真实场景正确性的人工评估（Human evaluation）本数据集旨在提升多语言自然语言处理（NLP）系统处理带有噪声、代码混合的印度语查询的性能。 ## 数据转换方法本数据集通过将真实场景下的代码混合查询转换为结构化英语构建而成，涵盖的变体类型包括： - 非正式措辞 - 印英代码混合语（Hinglish） - 受区域语言影响（旁遮普语） - 简短且带有噪声的用户输入该转换过程可模拟AI系统在实际生产环境中所接收的输入数据。

提供机构：

karanverma19

5,000+

优质数据集

54 个

任务类型

进入经典数据集