five

Humair332/Vast-Urdu

收藏
Hugging Face2026-01-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Humair332/Vast-Urdu
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - translation - token-classification language: - ur - en - zh - ar - hy - ak tags: - nmt - parallel-corpus - multilingual - urdu - large-scale - bitext - synthetic-data pretty_name: Vast Urdu Parallel Corpus --- # Vast Urdu Parallel Corpus ## Dataset Description **Vast-Urdu** is a large-scale collection of parallel text corpora specifically filtered to support Urdu (UR) language research. This dataset was extracted from the `liboaccn/nmt-parallel-corpus` to provide a dedicated resource for Neural Machine Translation (NMT), cross-lingual understanding, and token-classification tasks involving Urdu. ### Source Data The data is sourced from a massive web-scale crawl, containing sentence-aligned pairs between Urdu and several other languages including: * **English (en)** * **Chinese (zh)** * **Arabic (ar)** * **Armenian (hy)** * **Akan (ak)** ## Dataset Structure The files are provided in `.parquet` format for efficient storage and fast loading. Each file represents a language pair (e.g., `en-ur.parquet`), containing: - **Source text**: The text in the primary language. - **Target text**: The corresponding translation in Urdu (or vice-versa). ## Usage You can load this dataset directly using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("ReySajju742/Vast-Urdu", data_files="en-ur.parquet") print(dataset['train'][0])
提供机构:
Humair332
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作