five

surrey-nlp/Low-resource-QE-DA-dataset

收藏
Hugging Face2025-11-27 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/surrey-nlp/Low-resource-QE-DA-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - gu - hi - mr - ta - te - et - ne - si license: cc license_name: cc-by-sa-4.0 license_link: LICENSE tags: - nlp - machine-translation - quality-estimation - translation-quality-estimation - low-resource annotations_creators: - expert-generated language_creators: - expert-generated language_details: English, Gujarati, Hindi, Marathi, Tamil, Telugu, Estonian, Nepali, Sinhala pretty_name: Low-resource QE-DA size_categories: - 10K<n<100K source_datasets: - original task_categories: - other - sentence-similarity task_ids: - semantic-similarity-scoring configs: - config_name: engu default: true data_files: - split: train path: parquet/engu/train.parquet - split: validation path: parquet/engu/dev.parquet - split: test path: parquet/engu/test.parquet - config_name: enhi data_files: - split: train path: parquet/enhi/train.parquet - split: validation path: parquet/enhi/dev.parquet - split: test path: parquet/enhi/test.parquet - config_name: enmr data_files: - split: train path: parquet/enmr/train.parquet - split: validation path: parquet/enmr/dev.parquet - split: test path: parquet/enmr/test.parquet - config_name: enta data_files: - split: train path: parquet/enta/train.parquet - split: validation path: parquet/enta/dev.parquet - split: test path: parquet/enta/test.parquet - config_name: ente data_files: - split: train path: parquet/ente/train.parquet - split: validation path: parquet/ente/dev.parquet - split: test path: parquet/ente/test.parquet - config_name: eten data_files: - split: train path: parquet/eten/train.parquet - split: validation path: parquet/eten/dev.parquet - split: test path: parquet/eten/test.parquet - config_name: neen data_files: - split: train path: parquet/neen/train.parquet - split: validation path: parquet/neen/dev.parquet - split: test path: parquet/neen/test.parquet - config_name: sien data_files: - split: train path: parquet/sien/train.parquet - split: validation path: parquet/sien/dev.parquet - split: test path: parquet/sien/test.parquet - config_name: multilingual data_files: - split: train path: parquet/multilingual/train.parquet - split: validation path: parquet/multilingual/dev.parquet - split: test path: - parquet/multilingual/test_engu.parquet - parquet/multilingual/test_enhi.parquet - parquet/multilingual/test_enmr.parquet - parquet/multilingual/test_enta.parquet - parquet/multilingual/test_ente.parquet - parquet/multilingual/test_eten.parquet - parquet/multilingual/test_neen.parquet - parquet/multilingual/test_sien.parquet --- # Low-resource QE-DA Dataset Direct Assessment (DA) quality estimation data for English→Indic (Gujarati, Hindi, Marathi, Tamil, Telugu) and related Estonian/Nepali/Sinhala pairs, released with the ALOPE work on LLM-based QE. - **Paper**: Sindhujan, A., Qian, S., Matthew, C.C.C., Orasan, C., and Kanojia, D. (2024). *ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models.* In Second Conference on Language Modeling. ([arXiv](https://arxiv.org/html/2508.07484v1)) - **Task**: Sentence-level quality estimation with human DA scores and z-scores; some splits include model scores and PE strings. - **Format**: Three splits (`train`, `dev`, `test`) per language pair, TSV with columns `index`, `original`, `translation`, `scores`, `mean`, `z_scores`, `z_mean`. ## Loading with 🤗 Datasets Single language pair: ```python from datasets import load_dataset ds = load_dataset( "surrey-nlp/Low-resource-QE-DA-dataset", name="engu", # choose: engu, enhi, enmr, enta, ente, eten, neen, sien ) ``` Multilingual (train/dev mixed with language labels; test stays per language): ```python ds_multi = load_dataset( "surrey-nlp/Low-resource-QE-DA-dataset", name="multilingual", ) train = ds_multi["train"] # has field lang_pair val = ds_multi["validation"] test_engu = ds_multi["test_engu"] # per-language test splits ``` ## Data Statistics (DA mean, z_mean) All pairs combined (Parquet): - `[multilingual]` split=train | total=75,000 | DA mean 0.500–100.000 | z_mean -5.955–8.454 - `[multilingual]` split=dev | total=8,000 | DA mean 2.500–100.000 | z_mean -5.220–2.671 - `[multilingual]` split=test | total=7,699 | DA mean 2.333–100.000 | z_mean -4.610–2.576 Per language pair (Parquet): - `engu` train n=7,000 | DA 9.000–100.000 (median 88.667) | z_mean -4.026–0.888 (median 0.269) - `enhi` train n=7,000 | DA 14.250–98.250 (median 82.000) | z_mean -5.955–1.624 (median 0.097) - `enmr` train n=26,000 | DA 0.500–95.000 (median 71.750) | z_mean -5.551–8.454 (median 0.100) - `enta` train n=7,000 | DA 8.000–100.000 (median 90.333) | z_mean -4.540–0.883 (median 0.318) - `ente` train n=7,000 | DA 5.000–100.000 (median 80.000) | z_mean -2.452–0.749 (median 0.075) - `eten` train n=7,000 | DA 1.000–100.000 (median 75.333) | z_mean -2.754–1.382 (median 0.308) - `neen` train n=7,000 | DA 1.000–100.000 (median 34.000) | z_mean -2.125–3.177 (median -0.182) - `sien` train n=7,000 | DA 1.000–100.000 (median 48.667) | z_mean -2.008–1.888 (median -0.053) - Dev (n=1,000 each): DA medians 90.000/82.250/71.250/90.333/80.000/61.500/35.333/50.333 for engu/enhi/enmr/enta/ente/eten/neen/sien; z_mean medians 0.273/0.129/0.087/0.337/0.181/-0.069/-0.272/-0.248. - Test: engu/enhi/enta/ente/eten/neen/sien each n=1,000; enmr n=699. DA medians 88.333/82.000/85.667/80.000/54.500/34.000/52.250 (enmr 71.750). z_mean medians 0.256/0.088/0.164/0.177/-0.285/-0.275/-0.233 (enmr 0.135). Notes: - Statistics reflect the latest Parquet regeneration. Use `download_mode="force_redownload"` in `load_dataset` to recompute after caching.
提供机构:
surrey-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作