Eamin-sust/BanglaEng-SynCorpus

Name: Eamin-sust/BanglaEng-SynCorpus
Creator: Eamin-sust
Published: 2025-12-22 01:49:33
License: 暂无描述

Hugging Face2025-12-22 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Eamin-sust/BanglaEng-SynCorpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - bn - en license: cc-by-4.0 multilinguality: bilingual task_categories: - translation - text-generation tags: - bangla - bengali - english - synthetic-data - parallel-corpus - neural-machine-translation - low-resource - nmt size_categories: - 1M<n<10M - 10M<n<100M - 100M<n<1B - 1B<n<10B pretty_name: Bangla–English Synthetic Parallel Corpus --- # BanglaEng-SynCorpus ## Dataset Summary **BanglaEng-SynCorpus** is a large-scale **synthetic Bangla–English parallel corpus** designed to support research in **Neural Machine Translation (NMT)** and other Bangla–English bilingual NLP tasks. The corpus is generated using **linguistically validated sentence templates** combined with **topic-wise curated vocabularies**, covering **all 12 English/Bangla tense structures**. Due to extreme scale (trillions of possible sentence pairs), the dataset is released in **multiple representative subsets**, packaged as compressed `.tar` archives. This dataset is particularly suitable for: - Low-resource Bangla–English NMT - Controlled synthetic data experiments - Curriculum learning and scaling studies - Sentence-structure-aware translation models --- ## Supported Tasks - **Machine Translation (Bangla ↔ English)** - Text-to-text generation - Controlled synthetic data modeling - Linguistic and grammatical analysis --- ## Languages - **Bangla (bn)** - **English (en)** --- ## Dataset Structure The repository contains **tar-archived datasets** generated by selecting different numbers of words per topic across all sentence structures. ### Available Subsets | Subset | Word Selection | Approx. Sentence Pairs | |------|---------------|------------------------| | `BanglaEng-SynCorpus-10Words/` | 10 words per topic | ~4.1 million | | `BanglaEng-SynCorpus-20Words/` | 20 words per topic | ~51.8 million | | `BanglaEng-SynCorpus-50Words/` | 50 words per topic | ~2.7 billion | | `BanglaEng-SynCorpus-80Words/` | 80 words per topic | ~24.6 billion | Each folder contains one or more `.parquet` files due to size constraints. --- ## Data Generation Methodology ### Vocabulary - **3,990 parallel Bangla–English words** - **26 semantic categories**, including: - Names, relations (with gender) - Animals, food, fruits, places, professions - Countries, cities, activities, objects, etc. ### Sentence Structures - **9,648 validated parallel sentence templates** - Coverage of: - All **12 tense forms** - Positive and negative sentences - Templates use **category tags** (e.g., `<name>`, `<place/country/city>`) ### Generation Process 1. Topic-wise word selection 2. Template-based sentence expansion 3. Automated grammatical validation 4. Manual expert review of structures 5. Duplicate and language-mixing checks > The full combinatorial corpus exceeds **2.9 trillion** sentence pairs and is **not fully stored** due to storage limitations. --- ## Data Format The dataset is distributed as compressed `.tar` archives due to its large size. Each `.tar` file contains one or more **Apache Parquet (`.parquet`) files**. ### Parquet Schema Each Parquet file consists of **two columns**: | Column Name | Data Type | Description | |------------|-----------|-------------| | `bn` | string | Bangla (Bengali) sentence | | `en` | string | Corresponding English sentence | Each row represents a **single Bangla–English parallel sentence pair**. ### File Organization - Subsets are organized based on the number of words selected per topic: - `BanglaEng-SynCorpus-10Words/` - `BanglaEng-SynCorpus-20Words/` - `BanglaEng-SynCorpus-50Words/` - `BanglaEng-SynCorpus-80Words/` - Large subsets are split into multiple `.parquet` files for easier storage and transfer. - All Parquet files inside a subset follow the same schema. ### Accessing the Data Users must first extract the `.tar` archives and then load the Parquet files using standard data-processing libraries such as **PyArrow**, **pandas**, or **Apache Spark**. --- ## Intended Use This dataset is intended for **research and academic use**, including: - Training and evaluating Bangla–English NMT models - Studying synthetic data scaling effects - Grammar-aware translation experiments ⚠️ **Not recommended** as a replacement for fully natural parallel corpora in real-world production systems without fine-tuning on human-translated data. --- ## Limitations - Sentences are **synthetically generated** - Limited to **simple sentence structures** - Does not include discourse-level context - Vocabulary-driven, not frequency-driven --- ## Ethical Considerations - No personal or sensitive data - No scraped or copyrighted text - Fully synthetic and template-generated --- ## Citation If you use this dataset, please cite: ```bibtex @techreport{rahman2025banglaengsyncorpus, title={Developing a Synthetic Bangla-English Parallel Corpus for Neural Machine Translation}, author={Rahman, Md. Eamin and Selim, Mohammad Reza}, institution={Shahjalal University of Science and Technology}, year={2025} } ``` --- ## Authors & Contributors Md. Eamin Rahman Assistant Professor, Dept. of CSE, SUST Hugging Face: Eamin-sust Dr. Mohammad Reza Selim Professor, Dept. of CSE, SUST --- ## License This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. --- ## Acknowledgements This dataset was developed at Shahjalal University of Science and Technology (SUST) under the SUST Research Centre Project (ID: AS/2024/1/25), titled “Developing a Synthetic Bangla–English Parallel Corpus for Neural Machine Translation.” We sincerely thank all research assistants and annotators who contributed to the creation and validation of the corpus. --- ## Contact For questions or collaboration inquiries, please open an issue on the Hugging Face repository.

提供机构：

Eamin-sust

5,000+

优质数据集

54 个

任务类型

进入经典数据集