Humair332/Vast-Urdu

Name: Humair332/Vast-Urdu
Creator: Humair332
Published: 2026-01-17 06:39:43
License: 暂无描述

Hugging Face2026-01-17 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Humair332/Vast-Urdu

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - translation - token-classification language: - ur - en - zh - ar - hy - ak tags: - nmt - parallel-corpus - multilingual - urdu - large-scale - bitext - synthetic-data pretty_name: Vast Urdu Parallel Corpus --- # Vast Urdu Parallel Corpus ## Dataset Description **Vast-Urdu** is a large-scale collection of parallel text corpora specifically filtered to support Urdu (UR) language research. This dataset was extracted from the `liboaccn/nmt-parallel-corpus` to provide a dedicated resource for Neural Machine Translation (NMT), cross-lingual understanding, and token-classification tasks involving Urdu. ### Source Data The data is sourced from a massive web-scale crawl, containing sentence-aligned pairs between Urdu and several other languages including: * **English (en)** * **Chinese (zh)** * **Arabic (ar)** * **Armenian (hy)** * **Akan (ak)** ## Dataset Structure The files are provided in `.parquet` format for efficient storage and fast loading. Each file represents a language pair (e.g., `en-ur.parquet`), containing: - **Source text**: The text in the primary language. - **Target text**: The corresponding translation in Urdu (or vice-versa). ## Usage You can load this dataset directly using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("ReySajju742/Vast-Urdu", data_files="en-ur.parquet") print(dataset['train'][0])

提供机构：

Humair332

5,000+

优质数据集

54 个

任务类型

进入经典数据集