Peach23333/simcse-nli-multilingual
收藏Hugging Face2025-11-26 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Peach23333/simcse-nli-multilingual
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
task_categories:
- text-classification
languages:
- ch
- hi
license: mit
tags:
- simcse
- nli
- multilingual
---
# simcse-nli-multilingual
A unified multilingual NLI dataset designed for **supervised SimCSE training**, covering:
- **Chinese (ch)**
- **Hindi (hi)**
Each language file is formatted consistently as:
```
sent0, sent1, hard_neg
```
This structure is directly compatible with SimCSE supervised learning.
---
## Dataset Structure
```
simcse-nli-multilingual/
│
├── chinese.csv # Chinese NLI pairs from OCNLI + CMNLI
├── hindi.csv # Hindi NLI pairs from XNLI-hi
└── README.md
```
---
## Languages
### Chinese (ch)
Data sources:
- **OCNLI** (Original Chinese Natural Language Inference dataset)
- **CMNLI** (CLUE benchmark’s machine-translated MNLI variant)
Labels used:
- `entailment` → positive pairs
- `contradiction` → negative pool
Chinese JSON files were converted into SimCSE-style CSV format.
---
### Hindi (hi)
Source: **XNLI Hindi split (train + validation only)**
The test set is *not included* because it has no labels.
All Hindi text is preserved exactly as in the official XNLI dataset, including:
- Devanagari combining characters
- Zero-width joiners / non-joiners
- Code-mixing (Hindi + English words)
This is intentional and matches tokenizer expectations of multilingual models.
---
## Format (for SimCSE-supervised)
Each row follows:
- **sent0**: original sentence (premise)
- **sent1**: entailment sentence (hypothesis)
- **hard_neg**: a semantically distant contradiction or unrelated sentence
Example (Hindi):
| sent0 | sent1 | hard_neg |
|-------|-------|-----------|
| मेरे वॉकमैन टूट गए … | मैं परेशान हूँ कि … | हम आपका अनुसरण नहीं करेंगे . |
Example (Chinese):
| sent0 | sent1 | hard_neg |
|-------|-------|-----------|
| 这个问题很容易回答 | 这是值得解决的问题 | 他们把任务搞砸了 |
---
## Why this dataset?
Supervised SimCSE requires **entailment triples**, but NLI datasets across languages differ widely:
- Chinese data uses heterogeneous JSON formats
- XNLI Hindi contains decomposed Unicode
- No existing dataset matches the SimCSE CSV triple format
This repository unifies:
- Format
- Label mappings
- Hard negative construction
- Structural cleaning (text unchanged)
Making it plug-and-play for multilingual SimCSE training.
---
## How to Load via HuggingFace Datasets
Chinese:
```python
from datasets import load_dataset
ds = load_dataset(
"zianshang/simcse-nli-multilingual",
data_files="chinese.csv"
)
```
Hindi:
```python
ds = load_dataset(
"zianshang/simcse-nli-multilingual",
data_files="hindi.csv"
)
```
---
## Notes on Hindi Unicode
Hindi text includes:
- Combining characters (`्`, `ि`, `ी`, etc.)
- Zero-width joiner (ZWJ) and non-joiner (ZWNJ)
- Mixed Hindi–English tokens
These come from the official XNLI-hi dataset.
**Do not normalize or strip them**, as this may harm model performance.
---
## License
- Chinese sources follow respective OCNLI / CMNLI licenses
- Hindi source follows the XNLI license
- This repo redistributes only processed CSV files for research use
---
Maintained by **Zian Shang**.
For questions or issues, please contact via HuggingFace or GitHub.
提供机构:
Peach23333



