quora-duplicates

Name: quora-duplicates
Creator: maas
Published: 2025-11-01 16:19:30
License: 暂无描述

魔搭社区2025-11-01 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/sentence-transformers/quora-duplicates

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Quora Duplicate Questions This dataset contains the [Quora](https://huggingface.co/datasets/quora) Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for [this Kaggle Competition](https://www.kaggle.com/c/quora-question-pairs). ## Dataset Subsets ### `pair-class` subset * Columns: "sentence1", "sentence2", "label" * Column types: `str`, `str`, `class` with `{"0": "different", "1": "duplicate"}` * Examples: ```python { 'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0, } ``` * Collection strategy: A direct copy of [Quora](https://huggingface.co/datasets/quora), but with more conveniently parsable columns. * Deduplified: No ### `pair` subset * Columns: "anchor", "positive" * Column types: `str`, `str` * Examples: ```python { 'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?", } ``` * Collection strategy: Filtering away the "different" options from the `pair-class` subset, removing the label column, and renaming the columns. * Deduplified: No ### `triplet-all` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } ``` * Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take all possible triplet pairs. * Deduplified: No ### `triplet` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } ``` * Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take the anchor, positive and the first negative of each sample. * Deduplified: No

# Quora重复问题数据集卡片本数据集包含可直接配合句子Transformer（Sentence Transformers）训练嵌入模型的四种格式的[Quora](https://huggingface.co/datasets/quora)问题对数据集。该数据集最初由Quora为[该Kaggle竞赛](https://www.kaggle.com/c/quora-question-pairs)创建。 ## 数据集子集 ### `pair-class` 子集 * 列名："sentence1"、"sentence2"、"label" * 列类型：字符串（str）、字符串（str）、分类类型，分类标签映射为`{"0": "不同", "1": "重复"}` * 示例： python { 'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0, } * 采集策略：直接复制[Quora](https://huggingface.co/datasets/quora)原始数据集，但列结构更便于解析。 * 去重情况：未去重 ### `pair` 子集 * 列名："anchor"、"positive" * 列类型：字符串、字符串 * 示例： python { 'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?", } * 采集策略：从`pair-class`子集中过滤掉"不同"类别的样本，移除label列并重命名列名。 * 去重情况：未去重 ### `triplet-all` 子集 * 列名："anchor"、"positive"、"negative" * 列类型：字符串、字符串、字符串 * 示例： python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } * 采集策略：源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)，该数据集标注说明为："来自Quora的重复问题对，额外包含由交叉编码器（cross-encoder）挖掘并去噪的难负样本"，随后提取所有可能的三元组样本对。 * 去重情况：未去重 ### `triplet` 子集 * 列名："anchor"、"positive"、"negative" * 列类型：字符串、字符串、字符串 * 示例： python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } * 采集策略：源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)，该数据集标注说明为："来自Quora的重复问题对，额外包含由交叉编码器（cross-encoder）挖掘并去噪的难负样本"，随后提取每个样本的锚点、正样本与首个负样本。 * 去重情况：未去重

提供机构：

maas

创建时间：

2025-01-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集