five

NNEngine/English-Hindi_Translation

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NNEngine/English-Hindi_Translation
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - translation language: - en - hi tags: - machine-translation - english-hindi - parallel-corpus - synthetic-data - large-scale - nlp - benchmark - seq2seq - huggingface-dataset size_categories: - 1M<n<10M --- # 📘 README.md 👉 Copy everything below into your repository `README.md` --- # English–Hindi Massive Synthetic Translation Dataset ## 🧠 Overview This dataset is a large-scale synthetic parallel corpus for **English → Hindi machine translation**, designed to stress-test modern sequence-to-sequence models, tokenizers, and large-scale training pipelines. The corpus contains **10 million aligned sentence pairs** generated using a high-entropy template engine with: * 100+ subjects * 100+ verbs * 100+ objects * 100+ adjectives, adverbs, metrics, conditions, and scales * Structured bilingual phrase composition * Deterministic alignment between English and Hindi This produces **trillions of possible combinations**, ensuring minimal repetition even at massive scale. --- ## 📦 Dataset Structure ``` hf_translation_dataset/ ├── train.jsonl (8,000,000 sentence pairs) ├── test.jsonl (2,000,000 sentence pairs) └── README.md ``` Split ratio: * **Training:** 80% * **Testing:** 20% --- ## 🧾 Data Format Each line is a JSON object: ```json { "id": 934221, "en": "AI engineer efficiently_42 build systems condition_17 metric_88 remains optimized_12 and optimized_91 scale_55", "hi": "एआई इंजीनियर सिस्टम को कुशलता_42 निर्माण करते हैं स्थिति_17 मेट्रिक_88 अनुकूलित_12 और अनुकूलित_91 पैमाना_55" } ``` ### Fields | Field | Type | Description | | -------- | ------- | ------------------------ | | `id` | Integer | Unique sample identifier | | `en` | String | English sentence | | `hi` | String | Hindi translation | | Encoding | UTF-8 | Unicode safe | --- ## 📊 Dataset Characteristics * ✔️ Total samples: **10,000,000** * ✔️ Language pair: **English → Hindi** * ✔️ Vocabulary size: **100+ per lexical category** * ✔️ Combinatorial space: **>10¹⁴ unique pairs** * ✔️ Grammar-driven generation * ✔️ Balanced template distribution * ✔️ Deterministic alignment * ✔️ Streaming-friendly JSONL format --- ## 🎯 Intended Use This dataset is suitable for: * Machine translation benchmarking * Seq2Seq model stress testing * Tokenizer robustness analysis * Curriculum learning experiments * Large-scale distributed training validation * Synthetic data research * Parallel corpus augmentation --- ## ⚠️ Limitations * Synthetic grammar (not natural conversational Hindi). * No discourse-level coherence. * No idiomatic expressions or cultural nuance. * Artificial tokens (`optimized_42`, etc.) are symbolic placeholders. * Not suitable for production translation systems. This dataset is intended for **algorithmic benchmarking and scaling research**. --- ## 🤗 How to Load ```python from datasets import load_dataset dataset = load_dataset("NNEngine/your-dataset-name") print(dataset) ``` Streaming mode: ```python dataset = load_dataset( "NNEngine/your-dataset-name", streaming=True ) ``` --- ## 📜 License MIT License Free for research and educational usage. --- ## ✨ Author Created by **NNEngine** for large-scale NLP benchmarking and synthetic data research.
提供机构:
NNEngine
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作