takehika/wanli-ja-nli
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/takehika/wanli-ja-nli
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ja
- en
license: cc-by-4.0
configs:
- config_name: ja_only
data_files:
- split: train
path: data/ja_only/train.parquet
- split: test
path: data/ja_only/test.parquet
- config_name: bilingual
data_files:
- split: train
path: data/bilingual/train.parquet
- split: test
path: data/bilingual/test.parquet
task_categories:
- text-classification
task_ids:
- natural-language-inference
tags:
- nli
- wanli
- japanese
- translation
size_categories:
- 10K<n<100K
---
# wanli-ja-nli
**wanli-ja-nli** is a Japanese NLI dataset derived from [WANLI](https://huggingface.co/datasets/alisawuffles/WANLI), created by translating English premise-hypothesis pairs into Japanese and applying quality filtering.
Each record keeps source linkage fields (`source_id`, `source_pairID`) so users can trace back to the original WANLI example.
This repository provides two dataset configs:
- `ja_only`: training-oriented Japanese-only fields
- `bilingual`: English + Japanese parallel fields
## Quickstart
```python
from datasets import load_dataset
# Japanese-only
ja = load_dataset("takehika/wanli-ja-nli", "ja_only")
print(ja["train"][0])
# English-Japanese parallel
bi = load_dataset("takehika/wanli-ja-nli", "bilingual")
print(bi["train"][0])
```
## Dataset Overview
- Source dataset: `alisawuffles/WANLI`
- Source split sizes:
- `train`: 102,885
- `test`: 5,000
- This derived dataset contains accepted rows only:
- `train`: 73,942
- `test`: 3,505
- Record-level linkage fields to source WANLI:
- `source_id` (WANLI `id`)
- `source_pairID` (WANLI `pairID`)
## Configs
### `ja_only`
Files:
- `data/ja_only/train.parquet` (73,942 rows)
- `data/ja_only/test.parquet` (3,505 rows)
Fields:
- `source_id`
- `source_pairID`
- `source_split`
- `source_row_id_internal`
- `premise`
- `hypothesis`
- `gold`
### `bilingual`
Files:
- `data/bilingual/train.parquet` (73,942 rows)
- `data/bilingual/test.parquet` (3,505 rows)
Fields:
- `source_id`
- `source_pairID`
- `source_split`
- `source_row_id_internal`
- `premise_en`
- `hypothesis_en`
- `premise_ja`
- `hypothesis_ja`
- `gold`
## Label Space
- `entailment`
- `neutral`
- `contradiction`
## Processing
1. Translate WANLI English premise/hypothesis pairs into Japanese.
2. Stage-1 filtering:
- hard constraints: no numeric mismatch flags in premise/hypothesis
- length ratio constraint: `0.30 <= len_ratio <= 2.40`
- self-score thresholds
3. Stage-2 judge audit:
- Input: `premise_en`, `hypothesis_en`, `gold`, `premise_ja`, `hypothesis_ja`
- Decision: whether the translation preserves NLI validity (`pass=true/false`)
4. Final acceptance rule:
- accept only rows that pass both Stage-1 and Stage-2
Notes:
- Translation is LLM-based, and final acceptance combines rule-based Stage-1 checks with LLM-based signals and LLM-based Stage-2 judging.
- Stage-1 uses fixed thresholds in this release.
## Label Distribution Shift (Source vs Accepted)
This release publishes accepted rows only, so label proportions are shifted from source WANLI.
Train split:
- Source WANLI (102,885): entailment 37.43% (38,511), neutral 47.60% (48,977), contradiction 14.97% (15,397)
- This dataset (73,942): entailment 41.42% (30,626), neutral 42.13% (31,155), contradiction 16.45% (12,161)
- Retention by label vs source: entailment 79.53%, neutral 63.61%, contradiction 78.98%
Test split:
- Source WANLI (5,000): entailment 37.16% (1,858), neutral 47.94% (2,397), contradiction 14.90% (745)
- This dataset (3,505): entailment 41.74% (1,463), neutral 41.31% (1,448), contradiction 16.95% (594)
- Retention by label vs source: entailment 78.74%, neutral 60.41%, contradiction 79.73%
Practical implication:
- Neutral examples are relatively more likely to be filtered out than entailment/contradiction.
- Use caution when comparing absolute scores against models trained/evaluated on original WANLI.
## Source and Attribution
- Original dataset: [alisawuffles/WANLI](https://huggingface.co/datasets/alisawuffles/WANLI) — CC BY 4.0
- This dataset is an adapted/translated derivative of WANLI.
- Modifications made in this derivative:
- translated `premise` / `hypothesis` from English to Japanese
- applied two-stage quality filtering
- released accepted subset only
- preserved record-level linkage fields (`source_id`, `source_pairID`) to the original WANLI records
## License
- This dataset is licensed under CC BY 4.0.
## Limitations
- This dataset is machine-translated and automatically filtered/judged; residual translation and label-consistency errors may remain.
- Domain and style follow WANLI characteristics; transfer to other domains may vary.
## Citation
```bibtex
@misc{liu-etal-2022-wanli,
title = "WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation",
author = "Liu, Alisa and
Swayamdipta, Swabha and
Smith, Noah A. and
Choi, Yejin",
month = jan,
year = "2022",
url = "https://arxiv.org/pdf/2201.05955",
}
```
提供机构:
takehika



