ShreyaDev/HinGE

收藏

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ShreyaDev/HinGE

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - translation language: - hi - en tags: - code-mixing - code-switching - NLG size_categories: - 1K<n<10K --- <h1 style="text-align: center;">Abstract</h1> <p> Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.</p><br> ## Dataset Details **HinGE:** A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text is a high-quality Hindi-English code-mixed dataset for the NLG tasks, manually annotated by five annotators. The dataset contains the following columns: * **A. English, Hindi:** The parallel source sentences from the IITB English-Hindi parallel corpus. * **B. Human-generated Hinglish:** A list of Hinglish sentences generated by the human annotators. * **C. WAC:** Hinglish sentence generated by the WAC algorithm (see paper for more details). * **D. WAC rating1, WAC rating2:** Quality rating to the Hinglish sentence generated by the WAC algorithm. The quality rating ranges from 1-10. * **E. PAC:** Hinglish sentence generated by the PAC algorithm (see paper for more details). * **F. PAC rating1, PAC rating2:** Quality rating to the Hinglish sentence generated by the PAC algorithm. The quality rating ranges from 1-10. ### Dataset Description - **Curated by:** [Lingo Research Group at IIT Gandhinagar](https://lingo.iitgn.ac.in/) - **Language(s) (NLP):** Bilingual (Hindi [hi], English [en]) - **Licensed by:** cc-by-4.0 ## Citation If you use this dataset, please cite the following work: ``` @inproceedings{srivastava-singh-2021-hinge, title = "{H}in{GE}: A Dataset for Generation and Evaluation of Code-Mixed {H}inglish Text", author = "Srivastava, Vivek and Singh, Mayank", booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.eval4nlp-1.20/", doi = "10.18653/v1/2021.eval4nlp-1.20" } ```

license: CC-BY-4.0 task_categories: - 翻译 language: - 印地语（hi） - 英语（en） tags: - 语码混合（code-mixing） - 语码转换（code-switching） - 自然语言生成（NLG） size_categories: - 1000 < 样本数 < 10000 --- <h1 style="text-align: center;">摘要</h1> <p>文本生成是计算语言学界极具活跃度的研究方向。生成文本的评估是一项极具挑战性的任务，多年来学界已提出诸多理论与评估指标。遗憾的是，由于语码混合语言（单条文本或语音语句中混合多语言词汇与短语的语言）缺乏高质量资源，文本生成与评估的相关研究相对不足。为应对这一挑战，我们针对广受欢迎的印式英语（Hinglish，印地语与英语的语码混合语言）构建了语料库HinGE。HinGE包含由人类标注者生成的印式英语语句，以及基于两种规则算法生成的、与平行印地语-英语语句对应的印式英语语句。此外，我们证明了主流评估指标在语码混合数据集上的效能不足。本HinGE数据集将助力语码混合语言下的自然语言生成研究发展。</p><br> ## 数据集详情 **HinGE：语码混合印式英语文本生成与评估数据集**是面向自然语言生成（NLG）任务的高质量印地语-英语语码混合数据集，由五名标注者完成人工标注。该数据集包含以下列： * **A. 英语、印地语：** 源自IITB英印语平行语料库的平行源语句。 * **B. 人工生成印式英语：** 人类标注者生成的印式英语语句列表。 * **C. WAC：** 由WAC算法生成的印式英语语句（详细说明参见原论文）。 * **D. WAC评分1、WAC评分2：** 针对WAC算法生成的印式英语语句的质量评分，评分区间为1至10分。 * **E. PAC：** 由PAC算法生成的印式英语语句（详细说明参见原论文）。 * **F. PAC评分1、PAC评分2：** 针对PAC算法生成的印式英语语句的质量评分，评分区间为1至10分。 ### 数据集说明 - **整理方：** [印度甘地纳加尔理工学院Lingo研究团队](https://lingo.iitgn.ac.in/) - **自然语言处理所用语言：** 双语（印地语[hi]、英语[en]） - **授权协议：** CC-BY-4.0 ## 引用若使用本数据集，请引用以下文献： @inproceedings{srivastava-singh-2021-hinge, title = "{H}in{GE}: A Dataset for Generation and Evaluation of Code-Mixed {H}inglish Text", author = "Srivastava, Vivek and Singh, Mayank", booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.eval4nlp-1.20/", doi = "10.18653/v1/2021.eval4nlp-1.20" }

提供机构：

ShreyaDev

5,000+

优质数据集

54 个

任务类型

进入经典数据集

© 2023-2025 上海数据发展科技有限责任公司版权所有

沪ICP备17003045号-15 沪公网安备31010402336585号

二维码

社区交流群

面向社区/商业的数据集话题

二维码

科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作