Peach23333/simcse-nli-multilingual

Name: Peach23333/simcse-nli-multilingual
Creator: Peach23333
Published: 2025-11-26 15:25:13
License: 暂无描述

Hugging Face2025-11-26 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Peach23333/simcse-nli-multilingual

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: task_categories: - text-classification languages: - ch - hi license: mit tags: - simcse - nli - multilingual --- # simcse-nli-multilingual A unified multilingual NLI dataset designed for **supervised SimCSE training**, covering: - **Chinese (ch)** - **Hindi (hi)** Each language file is formatted consistently as: ``` sent0, sent1, hard_neg ``` This structure is directly compatible with SimCSE supervised learning. --- ## Dataset Structure ``` simcse-nli-multilingual/ │ ├── chinese.csv # Chinese NLI pairs from OCNLI + CMNLI ├── hindi.csv # Hindi NLI pairs from XNLI-hi └── README.md ``` --- ## Languages ### Chinese (ch) Data sources: - **OCNLI** (Original Chinese Natural Language Inference dataset) - **CMNLI** (CLUE benchmark’s machine-translated MNLI variant) Labels used: - `entailment` → positive pairs - `contradiction` → negative pool Chinese JSON files were converted into SimCSE-style CSV format. --- ### Hindi (hi) Source: **XNLI Hindi split (train + validation only)** The test set is *not included* because it has no labels. All Hindi text is preserved exactly as in the official XNLI dataset, including: - Devanagari combining characters - Zero-width joiners / non-joiners - Code-mixing (Hindi + English words) This is intentional and matches tokenizer expectations of multilingual models. --- ## Format (for SimCSE-supervised) Each row follows: - **sent0**: original sentence (premise) - **sent1**: entailment sentence (hypothesis) - **hard_neg**: a semantically distant contradiction or unrelated sentence Example (Hindi): | sent0 | sent1 | hard_neg | |-------|-------|-----------| | मेरे वॉकमैन टूट गए … | मैं परेशान हूँ कि … | हम आपका अनुसरण नहीं करेंगे . | Example (Chinese): | sent0 | sent1 | hard_neg | |-------|-------|-----------| | 这个问题很容易回答 | 这是值得解决的问题 | 他们把任务搞砸了 | --- ## Why this dataset? Supervised SimCSE requires **entailment triples**, but NLI datasets across languages differ widely: - Chinese data uses heterogeneous JSON formats - XNLI Hindi contains decomposed Unicode - No existing dataset matches the SimCSE CSV triple format This repository unifies: - Format - Label mappings - Hard negative construction - Structural cleaning (text unchanged) Making it plug-and-play for multilingual SimCSE training. --- ## How to Load via HuggingFace Datasets Chinese: ```python from datasets import load_dataset ds = load_dataset( "zianshang/simcse-nli-multilingual", data_files="chinese.csv" ) ``` Hindi: ```python ds = load_dataset( "zianshang/simcse-nli-multilingual", data_files="hindi.csv" ) ``` --- ## Notes on Hindi Unicode Hindi text includes: - Combining characters (`्`, `ि`, `ी`, etc.) - Zero-width joiner (ZWJ) and non-joiner (ZWNJ) - Mixed Hindi–English tokens These come from the official XNLI-hi dataset. **Do not normalize or strip them**, as this may harm model performance. --- ## License - Chinese sources follow respective OCNLI / CMNLI licenses - Hindi source follows the XNLI license - This repo redistributes only processed CSV files for research use --- Maintained by **Zian Shang**. For questions or issues, please contact via HuggingFace or GitHub.

提供机构：

Peach23333

5,000+

优质数据集

54 个

任务类型

进入经典数据集