five

takehika/wanli-ja-nli

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/takehika/wanli-ja-nli
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ja - en license: cc-by-4.0 configs: - config_name: ja_only data_files: - split: train path: data/ja_only/train.parquet - split: test path: data/ja_only/test.parquet - config_name: bilingual data_files: - split: train path: data/bilingual/train.parquet - split: test path: data/bilingual/test.parquet task_categories: - text-classification task_ids: - natural-language-inference tags: - nli - wanli - japanese - translation size_categories: - 10K<n<100K --- # wanli-ja-nli **wanli-ja-nli** is a Japanese NLI dataset derived from [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI), created by translating English premise-hypothesis pairs into Japanese and applying quality filtering. Each record keeps source linkage fields (`source_id`, `source_pairID`) so users can trace back to the original WANLI example. This repository provides two dataset configs: - `ja_only`: training-oriented Japanese-only fields - `bilingual`: English + Japanese parallel fields ## Quickstart ```python from datasets import load_dataset # Japanese-only ja = load_dataset("takehika/wanli-ja-nli", "ja_only") print(ja["train"][0]) # English-Japanese parallel bi = load_dataset("takehika/wanli-ja-nli", "bilingual") print(bi["train"][0]) ``` ## Dataset Overview - Source dataset: `alisawuffles/WANLI` - Source split sizes: - `train`: 102,885 - `test`: 5,000 - This derived dataset contains accepted rows only: - `train`: 73,942 - `test`: 3,505 - Record-level linkage fields to source WANLI: - `source_id` (WANLI `id`) - `source_pairID` (WANLI `pairID`) ## Configs ### `ja_only` Files: - `data/ja_only/train.parquet` (73,942 rows) - `data/ja_only/test.parquet` (3,505 rows) Fields: - `source_id` - `source_pairID` - `source_split` - `source_row_id_internal` - `premise` - `hypothesis` - `gold` ### `bilingual` Files: - `data/bilingual/train.parquet` (73,942 rows) - `data/bilingual/test.parquet` (3,505 rows) Fields: - `source_id` - `source_pairID` - `source_split` - `source_row_id_internal` - `premise_en` - `hypothesis_en` - `premise_ja` - `hypothesis_ja` - `gold` ## Label Space - `entailment` - `neutral` - `contradiction` ## Processing 1. Translate WANLI English premise/hypothesis pairs into Japanese. 2. Stage-1 filtering: - hard constraints: no numeric mismatch flags in premise/hypothesis - length ratio constraint: `0.30 <= len_ratio <= 2.40` - self-score thresholds 3. Stage-2 judge audit: - Input: `premise_en`, `hypothesis_en`, `gold`, `premise_ja`, `hypothesis_ja` - Decision: whether the translation preserves NLI validity (`pass=true/false`) 4. Final acceptance rule: - accept only rows that pass both Stage-1 and Stage-2 Notes: - Translation is LLM-based, and final acceptance combines rule-based Stage-1 checks with LLM-based signals and LLM-based Stage-2 judging. - Stage-1 uses fixed thresholds in this release. ## Label Distribution Shift (Source vs Accepted) This release publishes accepted rows only, so label proportions are shifted from source WANLI. Train split: - Source WANLI (102,885): entailment 37.43% (38,511), neutral 47.60% (48,977), contradiction 14.97% (15,397) - This dataset (73,942): entailment 41.42% (30,626), neutral 42.13% (31,155), contradiction 16.45% (12,161) - Retention by label vs source: entailment 79.53%, neutral 63.61%, contradiction 78.98% Test split: - Source WANLI (5,000): entailment 37.16% (1,858), neutral 47.94% (2,397), contradiction 14.90% (745) - This dataset (3,505): entailment 41.74% (1,463), neutral 41.31% (1,448), contradiction 16.95% (594) - Retention by label vs source: entailment 78.74%, neutral 60.41%, contradiction 79.73% Practical implication: - Neutral examples are relatively more likely to be filtered out than entailment/contradiction. - Use caution when comparing absolute scores against models trained/evaluated on original WANLI. ## Source and Attribution - Original dataset: [alisawuffles/WANLI](https://huggingface.co/datasets/alisawuffles/WANLI) — CC BY 4.0 - This dataset is an adapted/translated derivative of WANLI. - Modifications made in this derivative: - translated `premise` / `hypothesis` from English to Japanese - applied two-stage quality filtering - released accepted subset only - preserved record-level linkage fields (`source_id`, `source_pairID`) to the original WANLI records ## License - This dataset is licensed under CC BY 4.0. ## Limitations - This dataset is machine-translated and automatically filtered/judged; residual translation and label-consistency errors may remain. - Domain and style follow WANLI characteristics; transfer to other domains may vary. ## Citation ```bibtex @misc{liu-etal-2022-wanli, title = "WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation", author = "Liu, Alisa and Swayamdipta, Swabha and Smith, Noah A. and Choi, Yejin", month = jan, year = "2022", url = "https://arxiv.org/pdf/2201.05955", } ```
提供机构:
takehika
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作