five

quora-duplicates

收藏
魔搭社区2025-11-01 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/quora-duplicates
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Quora Duplicate Questions This dataset contains the [Quora](https://huggingface.co/datasets/quora) Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for [this Kaggle Competition](https://www.kaggle.com/c/quora-question-pairs). ## Dataset Subsets ### `pair-class` subset * Columns: "sentence1", "sentence2", "label" * Column types: `str`, `str`, `class` with `{"0": "different", "1": "duplicate"}` * Examples: ```python { 'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0, } ``` * Collection strategy: A direct copy of [Quora](https://huggingface.co/datasets/quora), but with more conveniently parsable columns. * Deduplified: No ### `pair` subset * Columns: "anchor", "positive" * Column types: `str`, `str` * Examples: ```python { 'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?", } ``` * Collection strategy: Filtering away the "different" options from the `pair-class` subset, removing the label column, and renaming the columns. * Deduplified: No ### `triplet-all` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } ``` * Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take all possible triplet pairs. * Deduplified: No ### `triplet` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } ``` * Collection strategy: Taken from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which states: "Duplicate question pairs from Quora with additional hard negatives (mined & denoised by cross-encoder)". Then, take the anchor, positive and the first negative of each sample. * Deduplified: No

# Quora重复问题数据集卡片 本数据集包含可直接配合句子Transformer(Sentence Transformers)训练嵌入模型的四种格式的[Quora](https://huggingface.co/datasets/quora)问题对数据集。该数据集最初由Quora为[该Kaggle竞赛](https://www.kaggle.com/c/quora-question-pairs)创建。 ## 数据集子集 ### `pair-class` 子集 * 列名:"sentence1"、"sentence2"、"label" * 列类型:字符串(str)、字符串(str)、分类类型,分类标签映射为`{"0": "不同", "1": "重复"}` * 示例: python { 'sentence1': 'What is the step by step guide to invest in share market in india?', 'sentence2': 'What is the step by step guide to invest in share market?', 'label': 0, } * 采集策略:直接复制[Quora](https://huggingface.co/datasets/quora)原始数据集,但列结构更便于解析。 * 去重情况:未去重 ### `pair` 子集 * 列名:"anchor"、"positive" * 列类型:字符串、字符串 * 示例: python { 'anchor': 'Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?', 'positive': "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?", } * 采集策略:从`pair-class`子集中过滤掉"不同"类别的样本,移除label列并重命名列名。 * 去重情况:未去重 ### `triplet-all` 子集 * 列名:"anchor"、"positive"、"negative" * 列类型:字符串、字符串、字符串 * 示例: python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } * 采集策略:源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data),该数据集标注说明为:"来自Quora的重复问题对,额外包含由交叉编码器(cross-encoder)挖掘并去噪的难负样本",随后提取所有可能的三元组样本对。 * 去重情况:未去重 ### `triplet` 子集 * 列名:"anchor"、"positive"、"negative" * 列类型:字符串、字符串、字符串 * 示例: python { 'anchor': 'Why in India do we not have one on one political debate as in USA?", 'positive': 'Why cant we have a public debate between politicians in India like the one in US?', 'negative': 'Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', } * 采集策略:源自[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data),该数据集标注说明为:"来自Quora的重复问题对,额外包含由交叉编码器(cross-encoder)挖掘并去噪的难负样本",随后提取每个样本的锚点、正样本与首个负样本。 * 去重情况:未去重
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作