srv-space/parlekha-aligned
收藏Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/srv-space/parlekha-aligned
下载链接
链接失效反馈官方服务:
资源简介:
# Parlekha Aligned Dataset
> **Extracted from:** [ai4bharat/Pralekha](https://huggingface.co/datasets/ai4bharat/Pralekha)
A multilingual parallel corpus containing aligned sentences across 11 Indian languages and English, extracted and curated from the Pralekha dataset.
## Pipeline Overview
<!-- TODO: Add pipeline diagram image here -->
<!--  -->
*An image illustrating the extraction and alignment pipeline will be added here.*
---
## Dataset Overview
| File | Rows | Languages Present | Missing Languages |
|------|------|-------------------|-------------------|
| `original/0_null.parquet` | 361,986 | 12 (all) | None |
| `original/1_null.parquet` | 254,153 | 11 | 1 language missing per row |
| `original/2_null.parquet` | 119,026 | 10 | 2 languages missing per row |
| `original/3_null.parquet` | 58,715 | 9 | 3 languages missing per row |
| `translated/1_null_translated.parquet` | 254,133 | 12 (all) | Filled via translation |
| `translated/2_null_translated.parquet` | 118,937 | 12 (all) | Filled via translation |
**Note:** Missing languages in the `translated/` files have been filled using Google Translate.
---
## Languages
The dataset includes **12 languages** (11 Indian languages + English):
| Code | Language | Script |
|------|----------|--------|
| `eng` | English | Latin |
| `ben` | Bengali | বাংলা |
| `guj` | Gujarati | ગુજરાતી |
| `hin` | Hindi | हिन्दी |
| `kan` | Kannada | ಕನ್ನಡ |
| `mal` | Malayalam | മലയാളം |
| `mar` | Marathi | मराठी |
| `ori` | Oriya/Odia | ଓଡ଼ିଆ |
| `pan` | Punjabi | ਪੰਜਾਬੀ |
| `tam` | Tamil | தமிழ் |
| `tel` | Telugu | తెలుగు |
| `urd` | Urdu | اردو |
---
## Dataset Structure
### Columns
- `sentence_id` - Unique identifier for each sentence
- `eng`, `ben`, `guj`, `hin`, `kan`, `mal`, `mar`, `ori`, `pan`, `tam`, `tel`, `urd` - Sentence translations in respective languages
### Folders
- **`original/`** - Original aligned data with varying completeness (may contain `null` values)
- **`translated/`** - Missing translations filled using Google Translate
---
## Download
```python
from huggingface_hub import hf_hub_download
from datasets import load_dataset
# Download all files
dataset = load_dataset("srv-space/parlekha-aligned")
# Download only original folder
original_data = load_dataset("srv-space/parlekha-aligned", data_dir="original")
# Download only translated folder
translated_data = load_dataset("srv-space/parlekha-aligned", data_dir="translated")
# Download specific file (e.g., 0_null.parquet)
zero_null = load_dataset("srv-space/parlekha-aligned", data_files="original/0_null.parquet")
# Download specific file using hf_hub_download
file_path = hf_hub_download(
repo_id="srv-space/parlekha-aligned",
filename="original/0_null.parquet",
repo_type="dataset"
)
```
---
## Usage Example
```python
from datasets import load_dataset
# Load complete aligned data (no nulls)
data = load_dataset("srv-space/parlekha-aligned", data_files="original/0_null.parquet", split="train")
# Access translations
for row in data:
print(f"ID: {row['sentence_id']}")
print(f"English: {row['eng']}")
print(f"Hindi: {row['hin']}")
print(f"Bengali: {row['ben']}")
print("---")
```
---
## Direct Links
### Original Files
- [0_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/0_null.parquet) - Complete aligned data (all 12 languages present)
- [1_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/1_null.parquet) - Data with 1 language missing per row
- [2_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/2_null.parquet) - Data with 2 languages missing per row
- [3_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/3_null.parquet) - Data with 3 languages missing per row
### Translated Files
- [1_null_translated.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/translated/1_null_translated.parquet) - 1 missing language filled via Google Translate
- [2_null_translated.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/translated/2_null_translated.parquet) - 2 missing languages filled via Google Translate
---
## Citation
If you use this dataset, please cite both this dataset and the original Pralekha dataset:
```bibtex
@misc{parlekha-aligned-2025,
title={Parlekha Aligned Dataset: Multilingual Parallel Corpus for Indian Languages},
author={srv-space},
year={2025},
howpublished={\url{https://huggingface.co/datasets/srv-space/parlekha-aligned}}
}
@inproceedings{doddapaneni2023pralekha,
title={Pralekha: A Comprehensive Benchmark for Evaluating LLMs on Indian Languages},
author={Doddapaneni, Sumanth and Aralikatte, Rahul and Khapra, Mitesh M},
booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
year={2023},
organization={Association for Computational Linguistics}
}
```
**Note:** Please update the Pralekha citation with the correct publication details if available.
---
## Acknowledgments
This dataset is extracted and aligned from the [Pralekha dataset](https://huggingface.co/datasets/ai4bharat/Pralekha) created by AI4Bharat. We thank the original authors for making their work publicly available.
---
## License
This dataset follows the same license as the parent [Pralekha dataset](https://huggingface.co/datasets/ai4bharat/Pralekha). Please refer to the original dataset for licensing information.
---
## Contact & Contributions
For questions, issues, or contributions, please open an issue on the dataset repository or contact the maintainers.
提供机构:
srv-space



