ReySajju742/Vast-Urdu
收藏Hugging Face2026-01-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ReySajju742/Vast-Urdu
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- translation
- token-classification
language:
- ur
- en
- zh
- ar
- hy
- ak
tags:
- nmt
- parallel-corpus
- multilingual
- urdu
- large-scale
- bitext
- synthetic-data
pretty_name: Vast Urdu Parallel Corpus
---
# Vast Urdu Parallel Corpus
## Dataset Description
**Vast-Urdu** is a large-scale collection of parallel text corpora specifically filtered to support Urdu (UR) language research. This dataset was extracted from the `liboaccn/nmt-parallel-corpus` to provide a dedicated resource for Neural Machine Translation (NMT), cross-lingual understanding, and token-classification tasks involving Urdu.
### Source Data
The data is sourced from a massive web-scale crawl, containing sentence-aligned pairs between Urdu and several other languages including:
* **English (en)**
* **Chinese (zh)**
* **Arabic (ar)**
* **Armenian (hy)**
* **Akan (ak)**
## Dataset Structure
The files are provided in `.parquet` format for efficient storage and fast loading. Each file represents a language pair (e.g., `en-ur.parquet`), containing:
- **Source text**: The text in the primary language.
- **Target text**: The corresponding translation in Urdu (or vice-versa).
## Usage
You can load this dataset directly using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("ReySajju742/Vast-Urdu", data_files="en-ur.parquet")
print(dataset['train'][0])
许可证:MIT
任务类别:
- 翻译
- Token 分类(token-classification)
语言:
- 乌尔都语(ur)
- 英语(en)
- 汉语(zh)
- 阿拉伯语(ar)
- 亚美尼亚语(hy)
- 阿肯语(ak)
标签:
- 神经机器翻译(NMT)
- 平行语料库(parallel-corpus)
- 多语言(multilingual)
- 乌尔都语(urdu)
- 大规模(large-scale)
- 双语平行语料(bitext)
- 合成数据(synthetic-data)
美观名称:大型乌尔都语平行语料库(Vast Urdu Parallel Corpus)
# 大型乌尔都语平行语料库(Vast Urdu Parallel Corpus)
## 数据集描述
**Vast-Urdu** 是专为支持乌尔都语(Urdu,UR)语言研究而筛选的大规模平行文本语料合集。本数据集源自`liboaccn/nmt-parallel-corpus`,旨在为涉及乌尔都语的神经机器翻译(Neural Machine Translation, NMT)、跨语言理解以及Token分类任务提供专属资源。
### 源数据
该数据集源自超大规模网页爬取资源,包含乌尔都语与以下多种语言的句对齐语料对:
* **英语(en)**
* **汉语(zh)**
* **阿拉伯语(ar)**
* **亚美尼亚语(hy)**
* **阿肯语(ak)**
## 数据集结构
数据集以`.parquet`格式存储,以实现高效存储与快速加载。每个文件对应一组语言对(例如`en-ur.parquet`),其中包含:
- **源文本**:对应主语言的文本内容。
- **目标文本**:乌尔都语对应的译文(或反之)。
## 使用方法
您可通过Hugging Face `datasets`库直接加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("ReySajju742/Vast-Urdu", data_files="en-ur.parquet")
print(dataset['train'][0])
提供机构:
ReySajju742



