five

Hinglish Language Corpus: A Blend of Synthetically Generated and Manually Written Sentences for NLP Research

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/vdtcp2yt9n
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is a unique collection of Hinglish (a mix of Hindi and English) sentences, consisting of both synthetically generated text using various Large Language Models (LLMs) such as ChatGPT, Gemini AI, Claude, Groq, and Deep Seek, as well as manually written sentences. The dataset encompasses a diverse range of text sources, including meeting minutes, debates, articles, short essays, emails, letters, tweets, communications, and quotes, all composed in Hinglish. The primary objective of this dataset is to support and facilitate research in the field of Natural Language Processing (NLP), particularly in the context of code-mixed languages like Hinglish. By providing a substantial corpus of Hinglish text from various domains and sources, this dataset aims to enable researchers to develop and test novel NLP techniques, models, and applications tailored to handle the unique challenges posed by code-mixed languages. The synthetic portion of the dataset, generated using state-of-the-art LLMs, offers a large volume of diverse Hinglish text that can be used for training and fine-tuning NLP models. The manually written sentences, on the other hand, provide a valuable benchmark for evaluating the performance of these models on human-generated Hinglish text.
创建时间:
2024-06-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作