quora-duplicates
收藏魔搭社区2025-11-01 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/quora-duplicates
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Quora Duplicate Questions
This dataset contains the [Quora](https://huggingface.co/datasets/quora) Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for [this Kaggle Competition](https://www.kaggle.com/c/quora-question-pairs).
## Dataset Subsets
### `pair-class` subset
* Columns: "sentence1", "sentence2", "label"
* Column types: `str`, `str`, `class` with `{"0": "different", "1": "duplicate"}`
* Examples:
```python
{
'sentence1': 'What is the step by step guide to invest in share market in india?',
'sentence2': 'What is the step by step guide to invest in share market?',
'label': 0,
}
```
* Collection strategy: A direct copy of [Quora](https://huggingface.co/datasets/quora), but with more conveniently parsable columns.
* Deduplified: No
### `pair` subset
* Columns: "anchor", "positive"
* Column types: `str`, `str`
* Examples:
```python
{
'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",
}
```
* Collection strategy: Filtering away the "different" options from the `pair-class` subset, removing the label column, and renaming the columns.
* Deduplified: No
### `triplet-all` subset
* Columns: "anchor", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
'anchor': 'Why in India do we not have one on one political debate as in USA?",
'positive': 'Why cant we have a public debate between politicians in India like the one in US?',
'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
}
```
* Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take all possible triplet pairs.
* Deduplified: No
### `triplet` subset
* Columns: "anchor", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
{
'anchor': 'Why in India do we not have one on one political debate as in USA?",
'positive': 'Why cant we have a public debate between politicians in India like the one in US?',
'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
}
```
* Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take the anchor, positive and the first negative of each sample.
* Deduplified: No
# Quora重复问题数据集卡片
本数据集包含可直接配合句子Transformer(Sentence Transformers)训练嵌入模型的四种格式的[Quora](https://huggingface.co/datasets/quora)问题对数据集。该数据集最初由Quora为[该Kaggle竞赛](https://www.kaggle.com/c/quora-question-pairs)创建。
## 数据集子集
### `pair-class` 子集
* 列名:"sentence1"、"sentence2"、"label"
* 列类型:字符串(str)、字符串(str)、分类类型,分类标签映射为`{"0": "不同", "1": "重复"}`
* 示例:
python
{
'sentence1': 'What is the step by step guide to invest in share market in india?',
'sentence2': 'What is the step by step guide to invest in share market?',
'label': 0,
}
* 采集策略:直接复制[Quora](https://huggingface.co/datasets/quora)原始数据集,但列结构更便于解析。
* 去重情况:未去重
### `pair` 子集
* 列名:"anchor"、"positive"
* 列类型:字符串、字符串
* 示例:
python
{
'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",
}
* 采集策略:从`pair-class`子集中过滤掉"不同"类别的样本,移除label列并重命名列名。
* 去重情况:未去重
### `triplet-all` 子集
* 列名:"anchor"、"positive"、"negative"
* 列类型:字符串、字符串、字符串
* 示例:
python
{
'anchor': 'Why in India do we not have one on one political debate as in USA?",
'positive': 'Why cant we have a public debate between politicians in India like the one in US?',
'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
}
* 采集策略:源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data),该数据集标注说明为:"来自Quora的重复问题对,额外包含由交叉编码器(cross-encoder)挖掘并去噪的难负样本",随后提取所有可能的三元组样本对。
* 去重情况:未去重
### `triplet` 子集
* 列名:"anchor"、"positive"、"negative"
* 列类型:字符串、字符串、字符串
* 示例:
python
{
'anchor': 'Why in India do we not have one on one political debate as in USA?",
'positive': 'Why cant we have a public debate between politicians in India like the one in US?',
'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
}
* 采集策略:源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data),该数据集标注说明为:"来自Quora的重复问题对,额外包含由交叉编码器(cross-encoder)挖掘并去噪的难负样本",随后提取每个样本的锚点、正样本与首个负样本。
* 去重情况:未去重
提供机构:
maas
创建时间:
2025-01-06



