five

quora-duplicates-mining

收藏
魔搭社区2025-11-01 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/quora-duplicates-mining
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Quora Duplicate Questions This dataset contains the [Quora](https://huggingface.co/datasets/quora) Question Pairs dataset in a format that is easily used with the [`ParaphraseMiningEvaluator`](https://sbert.net/docs/package_reference/evaluation.html#sentence_transformers.evaluation.ParaphraseMiningEvaluator) evaluator in Sentence Transformers. The data was originally created by Quora for [this Kaggle Competition](https://www.kaggle.com/c/quora-question-pairs). ## Usage ```python from datasets import load_dataset from sentence_transformers.SentenceTransformer import SentenceTransformer from sentence_transformers.evaluation import ParaphraseMiningEvaluator # Load the Quora Duplicates Mining dataset questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev") duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev") # Create a mapping from qid to question & a list of duplicates (qid1, qid2) qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"])) duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"])) # Initialize the paraphrase mining evaluator paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev") # Load a model to evaluate model = SentenceTransformer("all-MiniLM-L6-v2") results = paraphrase_mining_evaluator(model) print(results) ``` ``` { 'quora-duplicates-dev_average_precision': 0.5537837023752262, 'quora-duplicates-dev_f1': 0.542585123346778, 'quora-duplicates-dev_precision': 0.5112918195076678, 'quora-duplicates-dev_recall': 0.5779587350751861, 'quora-duplicates-dev_threshold': 0.8290803134441376, } ``` ## Dataset Subsets ### `questions` subset * Columns: "question", "qid" * Column types: `str`, `str` * Examples: ```python { 'question': 'How do I prepare for TCS IT Wiz?', 'qid': '107646', } ``` * Collection strategy: A direct copy of the `quora-IR-dataset/duplicate-mining` as generated from [`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py). * Deduplified: No ### `duplicates` subset * Columns: "qid1", "qid2" * Column types: `str`, `str` * Examples: ```python { 'qid1': '43345', 'qid2': '43346', } ``` * Collection strategy: A direct copy of the `quora-IR-dataset/duplicate-mining` as generated from [`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py). * Deduplified: No

# Quora重复问题数据集卡片 本数据集包含可直接与Sentence Transformers中的`ParaphraseMiningEvaluator`(释义挖掘评估器)适配使用的[Quora](https://huggingface.co/datasets/quora)问题对数据集。该数据最初由Quora为[该Kaggle竞赛](https://www.kaggle.com/c/quora-question-pairs)创建。 ## 使用方法 python from datasets import load_dataset from sentence_transformers.SentenceTransformer import SentenceTransformer from sentence_transformers.evaluation import ParaphraseMiningEvaluator # 加载Quora重复项挖掘数据集 questions_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "questions", split="dev") duplicates_dataset = load_dataset("sentence-transformers/quora-duplicates-mining", "duplicates", split="dev") # 创建从qid到问题的映射及重复项列表(qid1, qid2) qid_to_questions = dict(zip(questions_dataset["qid"], questions_dataset["question"])) duplicates = list(zip(duplicates_dataset["qid1"], duplicates_dataset["qid2"])) # 初始化释义挖掘评估器 paraphrase_mining_evaluator = ParaphraseMiningEvaluator(qid_to_questions, duplicates, name="quora-duplicates-dev") # 加载待评估模型 model = SentenceTransformer("all-MiniLM-L6-v2") results = paraphrase_mining_evaluator(model) print(results) { 'quora-duplicates-dev_average_precision': 0.5537837023752262, 'quora-duplicates-dev_f1': 0.542585123346778, 'quora-duplicates-dev_precision': 0.5112918195076678, 'quora-duplicates-dev_recall': 0.5779587350751861, 'quora-duplicates-dev_threshold': 0.8290803134441376, } ## 数据集子集 ### `questions` 子集 * 列名:"question"、"qid" * 列类型:字符串(str)、字符串(str) * 示例: python { 'question': '如何准备TCS IT Wiz大赛?', 'qid': '107646', } * 采集策略:直接复制由[`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py)生成的`quora-IR-dataset/duplicate-mining`数据集。 * 去重状态:否 ### `duplicates` 子集 * 列名:"qid1"、"qid2" * 列类型:字符串(str)、字符串(str) * 示例: python { 'qid1': '43345', 'qid2': '43346', } * 采集策略:直接复制由[`create_splits.py`](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/quora_duplicate_questions/create_splits.py)生成的`quora-IR-dataset/duplicate-mining`数据集。 * 去重状态:否
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作