ShreyaDev/HinGE
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ShreyaDev/HinGE
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- translation
language:
- hi
- en
tags:
- code-mixing
- code-switching
- NLG
size_categories:
- 1K<n<10K
---
<h1 style="text-align: center;">Abstract</h1>
<p> Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.</p><br>
## Dataset Details
**HinGE:** A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text is a high-quality Hindi-English code-mixed dataset for the NLG tasks, manually annotated by five annotators.
The dataset contains the following columns:
* **A. English, Hindi:** The parallel source sentences from the IITB English-Hindi parallel corpus.
* **B. Human-generated Hinglish:** A list of Hinglish sentences generated by the human annotators.
* **C. WAC:** Hinglish sentence generated by the WAC algorithm (see paper for more details).
* **D. WAC rating1, WAC rating2:** Quality rating to the Hinglish sentence generated by the WAC algorithm. The quality rating ranges from 1-10.
* **E. PAC:** Hinglish sentence generated by the PAC algorithm (see paper for more details).
* **F. PAC rating1, PAC rating2:** Quality rating to the Hinglish sentence generated by the PAC algorithm. The quality rating ranges from 1-10.
### Dataset Description
- **Curated by:** [Lingo Research Group at IIT Gandhinagar](https://lingo.iitgn.ac.in/)
- **Language(s) (NLP):** Bilingual (Hindi [hi], English [en])
- **Licensed by:** cc-by-4.0
## Citation
If you use this dataset, please cite the following work:
```
@inproceedings{srivastava-singh-2021-hinge,
title = "{H}in{GE}: A Dataset for Generation and Evaluation of Code-Mixed {H}inglish Text",
author = "Srivastava, Vivek and
Singh, Mayank",
booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eval4nlp-1.20/",
doi = "10.18653/v1/2021.eval4nlp-1.20"
}
```
license: CC-BY-4.0
task_categories:
- 翻译
language:
- 印地语(hi)
- 英语(en)
tags:
- 语码混合(code-mixing)
- 语码转换(code-switching)
- 自然语言生成(NLG)
size_categories:
- 1000 < 样本数 < 10000
---
<h1 style="text-align: center;">摘要</h1>
<p>文本生成是计算语言学界极具活跃度的研究方向。生成文本的评估是一项极具挑战性的任务,多年来学界已提出诸多理论与评估指标。遗憾的是,由于语码混合语言(单条文本或语音语句中混合多语言词汇与短语的语言)缺乏高质量资源,文本生成与评估的相关研究相对不足。为应对这一挑战,我们针对广受欢迎的印式英语(Hinglish,印地语与英语的语码混合语言)构建了语料库HinGE。HinGE包含由人类标注者生成的印式英语语句,以及基于两种规则算法生成的、与平行印地语-英语语句对应的印式英语语句。此外,我们证明了主流评估指标在语码混合数据集上的效能不足。本HinGE数据集将助力语码混合语言下的自然语言生成研究发展。</p><br>
## 数据集详情
**HinGE:语码混合印式英语文本生成与评估数据集**是面向自然语言生成(NLG)任务的高质量印地语-英语语码混合数据集,由五名标注者完成人工标注。
该数据集包含以下列:
* **A. 英语、印地语:** 源自IITB英印语平行语料库的平行源语句。
* **B. 人工生成印式英语:** 人类标注者生成的印式英语语句列表。
* **C. WAC:** 由WAC算法生成的印式英语语句(详细说明参见原论文)。
* **D. WAC评分1、WAC评分2:** 针对WAC算法生成的印式英语语句的质量评分,评分区间为1至10分。
* **E. PAC:** 由PAC算法生成的印式英语语句(详细说明参见原论文)。
* **F. PAC评分1、PAC评分2:** 针对PAC算法生成的印式英语语句的质量评分,评分区间为1至10分。
### 数据集说明
- **整理方:** [印度甘地纳加尔理工学院Lingo研究团队](https://lingo.iitgn.ac.in/)
- **自然语言处理所用语言:** 双语(印地语[hi]、英语[en])
- **授权协议:** CC-BY-4.0
## 引用
若使用本数据集,请引用以下文献:
@inproceedings{srivastava-singh-2021-hinge,
title = "{H}in{GE}: A Dataset for Generation and Evaluation of Code-Mixed {H}inglish Text",
author = "Srivastava, Vivek and
Singh, Mayank",
booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eval4nlp-1.20/",
doi = "10.18653/v1/2021.eval4nlp-1.20"
}
提供机构:
ShreyaDev



