surrey-nlp/Low-resource-QE-DA-dataset
收藏Hugging Face2025-11-27 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/surrey-nlp/Low-resource-QE-DA-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- gu
- hi
- mr
- ta
- te
- et
- ne
- si
license: cc
license_name: cc-by-sa-4.0
license_link: LICENSE
tags:
- nlp
- machine-translation
- quality-estimation
- translation-quality-estimation
- low-resource
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language_details: English, Gujarati, Hindi, Marathi, Tamil, Telugu, Estonian, Nepali, Sinhala
pretty_name: Low-resource QE-DA
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- other
- sentence-similarity
task_ids:
- semantic-similarity-scoring
configs:
- config_name: engu
default: true
data_files:
- split: train
path: parquet/engu/train.parquet
- split: validation
path: parquet/engu/dev.parquet
- split: test
path: parquet/engu/test.parquet
- config_name: enhi
data_files:
- split: train
path: parquet/enhi/train.parquet
- split: validation
path: parquet/enhi/dev.parquet
- split: test
path: parquet/enhi/test.parquet
- config_name: enmr
data_files:
- split: train
path: parquet/enmr/train.parquet
- split: validation
path: parquet/enmr/dev.parquet
- split: test
path: parquet/enmr/test.parquet
- config_name: enta
data_files:
- split: train
path: parquet/enta/train.parquet
- split: validation
path: parquet/enta/dev.parquet
- split: test
path: parquet/enta/test.parquet
- config_name: ente
data_files:
- split: train
path: parquet/ente/train.parquet
- split: validation
path: parquet/ente/dev.parquet
- split: test
path: parquet/ente/test.parquet
- config_name: eten
data_files:
- split: train
path: parquet/eten/train.parquet
- split: validation
path: parquet/eten/dev.parquet
- split: test
path: parquet/eten/test.parquet
- config_name: neen
data_files:
- split: train
path: parquet/neen/train.parquet
- split: validation
path: parquet/neen/dev.parquet
- split: test
path: parquet/neen/test.parquet
- config_name: sien
data_files:
- split: train
path: parquet/sien/train.parquet
- split: validation
path: parquet/sien/dev.parquet
- split: test
path: parquet/sien/test.parquet
- config_name: multilingual
data_files:
- split: train
path: parquet/multilingual/train.parquet
- split: validation
path: parquet/multilingual/dev.parquet
- split: test
path:
- parquet/multilingual/test_engu.parquet
- parquet/multilingual/test_enhi.parquet
- parquet/multilingual/test_enmr.parquet
- parquet/multilingual/test_enta.parquet
- parquet/multilingual/test_ente.parquet
- parquet/multilingual/test_eten.parquet
- parquet/multilingual/test_neen.parquet
- parquet/multilingual/test_sien.parquet
---
# Low-resource QE-DA Dataset
Direct Assessment (DA) quality estimation data for English→Indic (Gujarati, Hindi, Marathi, Tamil, Telugu) and related Estonian/Nepali/Sinhala pairs, released with the ALOPE work on LLM-based QE.
- **Paper**: Sindhujan, A., Qian, S., Matthew, C.C.C., Orasan, C., and Kanojia, D. (2024). *ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models.* In Second Conference on Language Modeling. ([arXiv](https://arxiv.org/html/2508.07484v1))
- **Task**: Sentence-level quality estimation with human DA scores and z-scores; some splits include model scores and PE strings.
- **Format**: Three splits (`train`, `dev`, `test`) per language pair, TSV with columns `index`, `original`, `translation`, `scores`, `mean`, `z_scores`, `z_mean`.
## Loading with 🤗 Datasets
Single language pair:
```python
from datasets import load_dataset
ds = load_dataset(
"surrey-nlp/Low-resource-QE-DA-dataset",
name="engu", # choose: engu, enhi, enmr, enta, ente, eten, neen, sien
)
```
Multilingual (train/dev mixed with language labels; test stays per language):
```python
ds_multi = load_dataset(
"surrey-nlp/Low-resource-QE-DA-dataset",
name="multilingual",
)
train = ds_multi["train"] # has field lang_pair
val = ds_multi["validation"]
test_engu = ds_multi["test_engu"] # per-language test splits
```
## Data Statistics (DA mean, z_mean)
All pairs combined (Parquet):
- `[multilingual]` split=train | total=75,000 | DA mean 0.500–100.000 | z_mean -5.955–8.454
- `[multilingual]` split=dev | total=8,000 | DA mean 2.500–100.000 | z_mean -5.220–2.671
- `[multilingual]` split=test | total=7,699 | DA mean 2.333–100.000 | z_mean -4.610–2.576
Per language pair (Parquet):
- `engu` train n=7,000 | DA 9.000–100.000 (median 88.667) | z_mean -4.026–0.888 (median 0.269)
- `enhi` train n=7,000 | DA 14.250–98.250 (median 82.000) | z_mean -5.955–1.624 (median 0.097)
- `enmr` train n=26,000 | DA 0.500–95.000 (median 71.750) | z_mean -5.551–8.454 (median 0.100)
- `enta` train n=7,000 | DA 8.000–100.000 (median 90.333) | z_mean -4.540–0.883 (median 0.318)
- `ente` train n=7,000 | DA 5.000–100.000 (median 80.000) | z_mean -2.452–0.749 (median 0.075)
- `eten` train n=7,000 | DA 1.000–100.000 (median 75.333) | z_mean -2.754–1.382 (median 0.308)
- `neen` train n=7,000 | DA 1.000–100.000 (median 34.000) | z_mean -2.125–3.177 (median -0.182)
- `sien` train n=7,000 | DA 1.000–100.000 (median 48.667) | z_mean -2.008–1.888 (median -0.053)
- Dev (n=1,000 each): DA medians 90.000/82.250/71.250/90.333/80.000/61.500/35.333/50.333 for engu/enhi/enmr/enta/ente/eten/neen/sien; z_mean medians 0.273/0.129/0.087/0.337/0.181/-0.069/-0.272/-0.248.
- Test: engu/enhi/enta/ente/eten/neen/sien each n=1,000; enmr n=699. DA medians 88.333/82.000/85.667/80.000/54.500/34.000/52.250 (enmr 71.750). z_mean medians 0.256/0.088/0.164/0.177/-0.285/-0.275/-0.233 (enmr 0.135).
Notes:
- Statistics reflect the latest Parquet regeneration. Use `download_mode="force_redownload"` in `load_dataset` to recompute after caching.
提供机构:
surrey-nlp



