five

nli-for-simcse

收藏
魔搭社区2025-11-27 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/nli-for-simcse
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for NLI for SimCSE This is a reformatting of the NLI for SimCSE Dataset used to train the [BGE-M3 model](https://huggingface.co/BAAI/bge-m3). See the full BGE-M3 dataset in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data). Despite being labeled as Natural Language Inference (NLI), this dataset can be used for training/finetuning an embedding model for semantic textual similarity. ## Dataset Subsets ### `triplet` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative': 'We have no one free at the moment so you have to take action yourself.' } ``` * Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking only the first negative. * Deduplified: No ### `triplet-7` subset * Columns: "anchor", "positive", "negative_1", "negative_2", "negative_3", "negative_4", "negative_5", "negative_6", "negative_7" * Column types: `str`, `str`, `str`, `str`, `str`, `str`, `str` * Examples: ```python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative_1': 'We have no one free at the moment so you have to take action yourself.', 'negative_2': 'A poodle is running through the grass.', 'negative_3': 'Investment and planning are growing industries in Jamaica.', 'negative_4': 'A bearded man is rocking out on an acoustic guitar', 'negative_5': 'The people are sunbathing on the beach.', 'negative_6': 'A construction worker installs a door.', 'negative_7': 'A crowd has gathered because of a dangerous situation.' } ``` * Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking all samples that have 7 negatives (which is by far the majority). * Deduplified: No ### `triplet-all` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative': 'We have no one free at the moment so you have to take action yourself.' } ``` * Collection strategy: Reading the jsonl file in the `en_NLI_data` directory in [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data) and taking each negative, but making a separate sample with each of the negatives. * Deduplified: No

# 面向SimCSE的自然语言推理数据集卡片 本数据集为用于训练BGE-M3模型的SimCSE自然语言推理(Natural Language Inference, NLI)数据集的重构版本。完整BGE-M3数据集可参见[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)。 尽管该数据集标注为自然语言推理,但可用于训练或微调面向语义文本相似度任务的嵌入模型。 ## 数据集子集 ### `triplet` 子集 * 列名:"锚点样本"、"正样本"、"负样本" * 列类型:均为字符串(str) * 示例: python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative': 'We have no one free at the moment so you have to take action yourself.' } * 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,仅选取第一个负样本。 * 去重情况:未去重 ### `triplet-7` 子集 * 列名:"锚点样本"、"正样本"、"负样本_1"、"负样本_2"、"负样本_3"、"负样本_4"、"负样本_5"、"负样本_6"、"负样本_7" * 列类型:均为字符串(str) * 示例: python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative_1': 'We have no one free at the moment so you have to take action yourself.', 'negative_2': 'A poodle is running through the grass.', 'negative_3': 'Investment and planning are growing industries in Jamaica.', 'negative_4': 'A bearded man is rocking out on an acoustic guitar', 'negative_5': 'The people are sunbathing on the beach.', 'negative_6': 'A construction worker installs a door.', 'negative_7': 'A crowd has gathered because of a dangerous situation.' } * 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,选取所有包含7个负样本的样本(此类样本占绝大多数)。 * 去重情况:未去重 ### `triplet-all` 子集 * 列名:"锚点样本"、"正样本"、"负样本" * 列类型:均为字符串(str) * 示例: python { 'anchor': 'One of our number will carry out your instructions minutely.', 'positive': 'A member of my team will execute your orders with immense precision.', 'negative': 'We have no one free at the moment so you have to take action yourself.' } * 采集策略:读取[Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data)数据集中`en_NLI_data`目录下的jsonl文件,对每个负样本单独生成一条独立样本。 * 去重情况:未去重
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作