Eamin-sust/BanglaEng-SynCorpus
收藏Hugging Face2025-12-22 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Eamin-sust/BanglaEng-SynCorpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- bn
- en
license: cc-by-4.0
multilinguality: bilingual
task_categories:
- translation
- text-generation
tags:
- bangla
- bengali
- english
- synthetic-data
- parallel-corpus
- neural-machine-translation
- low-resource
- nmt
size_categories:
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
pretty_name: Bangla–English Synthetic Parallel Corpus
---
# BanglaEng-SynCorpus
## Dataset Summary
**BanglaEng-SynCorpus** is a large-scale **synthetic Bangla–English parallel corpus** designed to support research in **Neural Machine Translation (NMT)** and other Bangla–English bilingual NLP tasks.
The corpus is generated using **linguistically validated sentence templates** combined with **topic-wise curated vocabularies**, covering **all 12 English/Bangla tense structures**.
Due to extreme scale (trillions of possible sentence pairs), the dataset is released in **multiple representative subsets**, packaged as compressed `.tar` archives.
This dataset is particularly suitable for:
- Low-resource Bangla–English NMT
- Controlled synthetic data experiments
- Curriculum learning and scaling studies
- Sentence-structure-aware translation models
---
## Supported Tasks
- **Machine Translation (Bangla ↔ English)**
- Text-to-text generation
- Controlled synthetic data modeling
- Linguistic and grammatical analysis
---
## Languages
- **Bangla (bn)**
- **English (en)**
---
## Dataset Structure
The repository contains **tar-archived datasets** generated by selecting different numbers of words per topic across all sentence structures.
### Available Subsets
| Subset | Word Selection | Approx. Sentence Pairs |
|------|---------------|------------------------|
| `BanglaEng-SynCorpus-10Words/` | 10 words per topic | ~4.1 million |
| `BanglaEng-SynCorpus-20Words/` | 20 words per topic | ~51.8 million |
| `BanglaEng-SynCorpus-50Words/` | 50 words per topic | ~2.7 billion |
| `BanglaEng-SynCorpus-80Words/` | 80 words per topic | ~24.6 billion |
Each folder contains one or more `.parquet` files due to size constraints.
---
## Data Generation Methodology
### Vocabulary
- **3,990 parallel Bangla–English words**
- **26 semantic categories**, including:
- Names, relations (with gender)
- Animals, food, fruits, places, professions
- Countries, cities, activities, objects, etc.
### Sentence Structures
- **9,648 validated parallel sentence templates**
- Coverage of:
- All **12 tense forms**
- Positive and negative sentences
- Templates use **category tags** (e.g., `<name>`, `<place/country/city>`)
### Generation Process
1. Topic-wise word selection
2. Template-based sentence expansion
3. Automated grammatical validation
4. Manual expert review of structures
5. Duplicate and language-mixing checks
> The full combinatorial corpus exceeds **2.9 trillion** sentence pairs and is **not fully stored** due to storage limitations.
---
## Data Format
The dataset is distributed as compressed `.tar` archives due to its large size.
Each `.tar` file contains one or more **Apache Parquet (`.parquet`) files**.
### Parquet Schema
Each Parquet file consists of **two columns**:
| Column Name | Data Type | Description |
|------------|-----------|-------------|
| `bn` | string | Bangla (Bengali) sentence |
| `en` | string | Corresponding English sentence |
Each row represents a **single Bangla–English parallel sentence pair**.
### File Organization
- Subsets are organized based on the number of words selected per topic:
- `BanglaEng-SynCorpus-10Words/`
- `BanglaEng-SynCorpus-20Words/`
- `BanglaEng-SynCorpus-50Words/`
- `BanglaEng-SynCorpus-80Words/`
- Large subsets are split into multiple `.parquet` files for easier storage and transfer.
- All Parquet files inside a subset follow the same schema.
### Accessing the Data
Users must first extract the `.tar` archives and then load the Parquet files using standard data-processing libraries such as **PyArrow**, **pandas**, or **Apache Spark**.
---
## Intended Use
This dataset is intended for **research and academic use**, including:
- Training and evaluating Bangla–English NMT models
- Studying synthetic data scaling effects
- Grammar-aware translation experiments
⚠️ **Not recommended** as a replacement for fully natural parallel corpora in real-world production systems without fine-tuning on human-translated data.
---
## Limitations
- Sentences are **synthetically generated**
- Limited to **simple sentence structures**
- Does not include discourse-level context
- Vocabulary-driven, not frequency-driven
---
## Ethical Considerations
- No personal or sensitive data
- No scraped or copyrighted text
- Fully synthetic and template-generated
---
## Citation
If you use this dataset, please cite:
```bibtex
@techreport{rahman2025banglaengsyncorpus,
title={Developing a Synthetic Bangla-English Parallel Corpus for Neural Machine Translation},
author={Rahman, Md. Eamin and Selim, Mohammad Reza},
institution={Shahjalal University of Science and Technology},
year={2025}
}
```
---
## Authors & Contributors
Md. Eamin Rahman
Assistant Professor, Dept. of CSE, SUST
Hugging Face: Eamin-sust
Dr. Mohammad Reza Selim
Professor, Dept. of CSE, SUST
---
## License
This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
---
## Acknowledgements
This dataset was developed at Shahjalal University of Science and Technology (SUST) under the SUST Research Centre Project (ID: AS/2024/1/25), titled “Developing a Synthetic Bangla–English Parallel Corpus for Neural Machine Translation.”
We sincerely thank all research assistants and annotators who contributed to the creation and validation of the corpus.
---
## Contact
For questions or collaboration inquiries, please open an issue on the Hugging Face repository.
提供机构:
Eamin-sust



