wikianswers-duplicates

Name: wikianswers-duplicates
Creator: maas
Published: 2025-11-07 16:19:56
License: 暂无描述

魔搭社区2025-11-07 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/sentence-transformers/wikianswers-duplicates

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for WikiAnswers Duplicate Questions This dataset contains duplicate questions from the [WikiAnswers Corpus](https://github.com/afader/oqa#wikianswers-corpus), formatted to be easily used with Sentence Transformers to train embedding models. ## Dataset Subsets ### `pair` subset * Columns: "anchor", "positive" * Column types: `str`, `str` * Examples: ```python { 'anchor': 'How many calories is in a handful of strawberries?', 'positive': 'How many calories are in a strawberry popsickles?', } ``` * Collection strategy: Reading the WikiAnswers dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate questions. I've considered all adjacent questions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate questions results in 5 duplicate pairs. * Deduplified: No

# WikiAnswers重复问题数据集卡片本数据集包含源自[WikiAnswers语料库（WikiAnswers Corpus）](https://github.com/afader/oqa#wikianswers-corpus)的重复问题，其格式经过优化，可直接通过Sentence Transformers开展嵌入模型训练。 ## 数据集子集 ### `pair` 子集 * 列名："anchor"、"positive" * 列类型：`str`、`str` * 示例： python { 'anchor': 'How many calories is in a handful of strawberries?', 'positive': 'How many calories are in a strawberry popsickles?', } * 采集策略：从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)数据集读取WikiAnswers数据集，该数据集内置重复问题列表。本实现将所有相邻问题视为正样本对，同时将首尾两个问题纳入配对。例如，若存在5个重复问题，则会生成5组重复样本对。 * 去重处理：否

提供机构：

maas

创建时间：

2025-01-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集