quora-duplicates-mining
收藏魔搭社区2025-11-01 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/quora-duplicates-mining
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Quora Duplicate Questions
This dataset contains the [Quora](https://huggingface.co/datasets/quora) Question Pairs dataset in a format that is easily used with the [`ParaphraseMiningEvaluator`](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.ParaphraseMiningEvaluator) evaluator in Sentence Transformers. The data was originally created by Quora for [this Kaggle Competition](https://www.kaggle.com/c/quora-question-pairs).
## Usage
```python
from datasets import load_dataset
from sentence_transformers.SentenceTransformer import SentenceTransformer
from sentence_transformers.evaluation import ParaphraseMiningEvaluator
# Load the Quora Duplicates Mining dataset
questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
# Create a mapping from qid to question & a list of duplicates (qid1, qid2)
qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
# Initialize the paraphrase mining evaluator
paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev")
# Load a model to evaluate
model = SentenceTransformer("all-MiniLM-L6-v2")
results = paraphrase_mining_evaluator(model)
print(results)
```
```
{
'quora-duplicates-dev_average_precision': 0.5537837023752262,
'quora-duplicates-dev_f1': 0.542585123346778,
'quora-duplicates-dev_precision': 0.5112918195076678,
'quora-duplicates-dev_recall': 0.5779587350751861,
'quora-duplicates-dev_threshold': 0.8290803134441376,
}
```
## Dataset Subsets
### `questions` subset
* Columns: "question", "qid"
* Column types: `str`, `str`
* Examples:
```python
{
'question': 'How do I prepare for TCS IT Wiz?',
'qid': '107646',
}
```
* Collection strategy: A direct copy of the `quora-IR-dataset/duplicate-mining` as generated from [`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py).
* Deduplified: No
### `duplicates` subset
* Columns: "qid1", "qid2"
* Column types: `str`, `str`
* Examples:
```python
{
'qid1': '43345',
'qid2': '43346',
}
```
* Collection strategy: A direct copy of the `quora-IR-dataset/duplicate-mining` as generated from [`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py).
* Deduplified: No
# Quora重复问题数据集卡片
本数据集包含可直接与Sentence Transformers中的`ParaphraseMiningEvaluator`(释义挖掘评估器)适配使用的[Quora](https://huggingface.co/datasets/quora)问题对数据集。该数据最初由Quora为[该Kaggle竞赛](https://www.kaggle.com/c/quora-question-pairs)创建。
## 使用方法
python
from datasets import load_dataset
from sentence_transformers.SentenceTransformer import SentenceTransformer
from sentence_transformers.evaluation import ParaphraseMiningEvaluator
# 加载Quora重复项挖掘数据集
questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev")
duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev")
# 创建从qid到问题的映射及重复项列表(qid1, qid2)
qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"]))
duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"]))
# 初始化释义挖掘评估器
paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev")
# 加载待评估模型
model = SentenceTransformer("all-MiniLM-L6-v2")
results = paraphrase_mining_evaluator(model)
print(results)
{
'quora-duplicates-dev_average_precision': 0.5537837023752262,
'quora-duplicates-dev_f1': 0.542585123346778,
'quora-duplicates-dev_precision': 0.5112918195076678,
'quora-duplicates-dev_recall': 0.5779587350751861,
'quora-duplicates-dev_threshold': 0.8290803134441376,
}
## 数据集子集
### `questions` 子集
* 列名:"question"、"qid"
* 列类型:字符串(str)、字符串(str)
* 示例:
python
{
'question': '如何准备TCS IT Wiz大赛?',
'qid': '107646',
}
* 采集策略:直接复制由[`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py)生成的`quora-IR-dataset/duplicate-mining`数据集。
* 去重状态:否
### `duplicates` 子集
* 列名:"qid1"、"qid2"
* 列类型:字符串(str)、字符串(str)
* 示例:
python
{
'qid1': '43345',
'qid2': '43346',
}
* 采集策略:直接复制由[`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py)生成的`quora-IR-dataset/duplicate-mining`数据集。
* 去重状态:否
提供机构:
maas
创建时间:
2025-01-06



