nli-for-simcse
收藏魔搭社区2025-11-27 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/nli-for-simcse
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for NLI for SimCSE
This is a reformatting of the NLI for SimCSE Dataset used to train the [BGE-M3 model](https://huggingface.co/BAAI/bge-m3). See the full BGE-M3 dataset in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data).
Despite being labeled as Natural Language Inference (NLI), this dataset can be used for training/finetuning an embedding model for semantic textual similarity.
## Dataset Subsets
### `triplet` subset
* Columns: "anchor", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative': 'We have no one free at the moment so you have to take action yourself.'
}
```
* Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking only the first negative.
* Deduplified: No
### `triplet-7` subset
* Columns: "anchor", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7"
* Column types: `str`, `str`, `str`, `str`, `str`, `str`, `str`
* Examples:
```python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative_1': 'We have no one free at the moment so you have to take action yourself.',
'negative_2': 'A poodle is running through the grass.',
'negative_3': 'Investment and planning are growing industries in Jamaica.',
'negative_4': 'A bearded man is rocking out on an acoustic guitar',
'negative_5': 'The people are sunbathing on the beach.',
'negative_6': 'A construction worker installs a door.',
'negative_7': 'A crowd has gathered because of a dangerous situation.'
}
```
* Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking all samples that have 7 negatives (which is by far the majority).
* Deduplified: No
### `triplet-all` subset
* Columns: "anchor", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative': 'We have no one free at the moment so you have to take action yourself.'
}
```
* Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking each negative, but making a separate sample with each of the negatives.
* Deduplified: No
# 面向SimCSE的自然语言推理数据集卡片
本数据集为用于训练BGE-M3模型的SimCSE自然语言推理(Natural Language Inference, NLI)数据集的重构版本。完整BGE-M3数据集可参见[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)。
尽管该数据集标注为自然语言推理,但可用于训练或微调面向语义文本相似度任务的嵌入模型。
## 数据集子集
### `triplet` 子集
* 列名:"锚点样本"、"正样本"、"负样本"
* 列类型:均为字符串(str)
* 示例:
python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative': 'We have no one free at the moment so you have to take action yourself.'
}
* 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,仅选取第一个负样本。
* 去重情况:未去重
### `triplet-7` 子集
* 列名:"锚点样本"、"正样本"、"负样本_1"、"负样本_2"、"负样本_3"、"负样本_4"、"负样本_5"、"负样本_6"、"负样本_7"
* 列类型:均为字符串(str)
* 示例:
python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative_1': 'We have no one free at the moment so you have to take action yourself.',
'negative_2': 'A poodle is running through the grass.',
'negative_3': 'Investment and planning are growing industries in Jamaica.',
'negative_4': 'A bearded man is rocking out on an acoustic guitar',
'negative_5': 'The people are sunbathing on the beach.',
'negative_6': 'A construction worker installs a door.',
'negative_7': 'A crowd has gathered because of a dangerous situation.'
}
* 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,选取所有包含7个负样本的样本(此类样本占绝大多数)。
* 去重情况:未去重
### `triplet-all` 子集
* 列名:"锚点样本"、"正样本"、"负样本"
* 列类型:均为字符串(str)
* 示例:
python
{
'anchor': 'One of our number will carry out your instructions minutely.',
'positive': 'A member of my team will execute your orders with immense precision.',
'negative': 'We have no one free at the moment so you have to take action yourself.'
}
* 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,对每个负样本单独生成一条独立样本。
* 去重情况:未去重
提供机构:
maas
创建时间:
2025-01-06



