Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/vdtcp2yt9n
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is a unique collection of Hinglish (a mix of Hindi and English) sentences, consisting of both synthetically generated text using various Large Language Models (LLMs) such as ChatGPT, Gemini AI, Claude, Groq, and Deep Seek, as well as manually written sentences. The dataset encompasses a diverse range of text sources, including meeting minutes, debates, articles, short essays, emails, letters, tweets, communications, and quotes, all composed in Hinglish.
The primary objective of this dataset is to support and facilitate research in the field of Natural Language Processing (NLP), particularly in the context of code-mixed languages like Hinglish. By providing a substantial corpus of Hinglish text from various domains and sources, this dataset aims to enable researchers to develop and test novel NLP techniques, models, and applications tailored to handle the unique challenges posed by code-mixed languages.
The synthetic portion of the dataset, generated using state-of-the-art LLMs, offers a large volume of diverse Hinglish text that can be used for training and fine-tuning NLP models. The manually written sentences, on the other hand, provide a valuable benchmark for evaluating the performance of these models on human-generated Hinglish text.
创建时间:
2024-06-24



