srv-space/parlekha-aligned

Name: srv-space/parlekha-aligned
Creator: srv-space
Published: 2026-01-18 16:51:01
License: 暂无描述

Hugging Face2026-01-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/srv-space/parlekha-aligned

下载链接

链接失效反馈

官方服务：

资源简介：

# Parlekha Aligned Dataset > **Extracted from:** [ai4bharat/Pralekha](https://huggingface.co/datasets/ai4bharat/Pralekha) A multilingual parallel corpus containing aligned sentences across 11 Indian languages and English, extracted and curated from the Pralekha dataset. ## Pipeline Overview   *An image illustrating the extraction and alignment pipeline will be added here.* --- ## Dataset Overview | File | Rows | Languages Present | Missing Languages | |------|------|-------------------|-------------------| | `original/0_null.parquet` | 361,986 | 12 (all) | None | | `original/1_null.parquet` | 254,153 | 11 | 1 language missing per row | | `original/2_null.parquet` | 119,026 | 10 | 2 languages missing per row | | `original/3_null.parquet` | 58,715 | 9 | 3 languages missing per row | | `translated/1_null_translated.parquet` | 254,133 | 12 (all) | Filled via translation | | `translated/2_null_translated.parquet` | 118,937 | 12 (all) | Filled via translation | **Note:** Missing languages in the `translated/` files have been filled using Google Translate. --- ## Languages The dataset includes **12 languages** (11 Indian languages + English): | Code | Language | Script | |------|----------|--------| | `eng` | English | Latin | | `ben` | Bengali | বাংলা | | `guj` | Gujarati | ગુજરાતી | | `hin` | Hindi | हिन्दी | | `kan` | Kannada | ಕನ್ನಡ | | `mal` | Malayalam | മലയാളം | | `mar` | Marathi | मराठी | | `ori` | Oriya/Odia | ଓଡ଼ିଆ | | `pan` | Punjabi | ਪੰਜਾਬੀ | | `tam` | Tamil | தமிழ் | | `tel` | Telugu | తెలుగు | | `urd` | Urdu | اردو | --- ## Dataset Structure ### Columns - `sentence_id` - Unique identifier for each sentence - `eng`, `ben`, `guj`, `hin`, `kan`, `mal`, `mar`, `ori`, `pan`, `tam`, `tel`, `urd` - Sentence translations in respective languages ### Folders - **`original/`** - Original aligned data with varying completeness (may contain `null` values) - **`translated/`** - Missing translations filled using Google Translate --- ## Download ```python from huggingface_hub import hf_hub_download from datasets import load_dataset # Download all files dataset = load_dataset("srv-space/parlekha-aligned") # Download only original folder original_data = load_dataset("srv-space/parlekha-aligned", data_dir="original") # Download only translated folder translated_data = load_dataset("srv-space/parlekha-aligned", data_dir="translated") # Download specific file (e.g., 0_null.parquet) zero_null = load_dataset("srv-space/parlekha-aligned", data_files="original/0_null.parquet") # Download specific file using hf_hub_download file_path = hf_hub_download( repo_id="srv-space/parlekha-aligned", filename="original/0_null.parquet", repo_type="dataset" ) ``` --- ## Usage Example ```python from datasets import load_dataset # Load complete aligned data (no nulls) data = load_dataset("srv-space/parlekha-aligned", data_files="original/0_null.parquet", split="train") # Access translations for row in data: print(f"ID: {row['sentence_id']}") print(f"English: {row['eng']}") print(f"Hindi: {row['hin']}") print(f"Bengali: {row['ben']}") print("---") ``` --- ## Direct Links ### Original Files - [0_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/0_null.parquet) - Complete aligned data (all 12 languages present) - [1_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/1_null.parquet) - Data with 1 language missing per row - [2_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/2_null.parquet) - Data with 2 languages missing per row - [3_null.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/original/3_null.parquet) - Data with 3 languages missing per row ### Translated Files - [1_null_translated.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/translated/1_null_translated.parquet) - 1 missing language filled via Google Translate - [2_null_translated.parquet](https://huggingface.co/datasets/srv-space/parlekha-aligned/blob/main/translated/2_null_translated.parquet) - 2 missing languages filled via Google Translate --- ## Citation If you use this dataset, please cite both this dataset and the original Pralekha dataset: ```bibtex @misc{parlekha-aligned-2025, title={Parlekha Aligned Dataset: Multilingual Parallel Corpus for Indian Languages}, author={srv-space}, year={2025}, howpublished={\url{https://huggingface.co/datasets/srv-space/parlekha-aligned}} } @inproceedings{doddapaneni2023pralekha, title={Pralekha: A Comprehensive Benchmark for Evaluating LLMs on Indian Languages}, author={Doddapaneni, Sumanth and Aralikatte, Rahul and Khapra, Mitesh M}, booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, year={2023}, organization={Association for Computational Linguistics} } ``` **Note:** Please update the Pralekha citation with the correct publication details if available. --- ## Acknowledgments This dataset is extracted and aligned from the [Pralekha dataset](https://huggingface.co/datasets/ai4bharat/Pralekha) created by AI4Bharat. We thank the original authors for making their work publicly available. --- ## License This dataset follows the same license as the parent [Pralekha dataset](https://huggingface.co/datasets/ai4bharat/Pralekha). Please refer to the original dataset for licensing information. --- ## Contact & Contributions For questions, issues, or contributions, please open an issue on the dataset repository or contact the maintainers.

提供机构：

srv-space

5,000+

优质数据集

54 个

任务类型

进入经典数据集