five

wikianswers-duplicates

收藏
魔搭社区2025-11-07 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/wikianswers-duplicates
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for WikiAnswers Duplicate Questions This dataset contains duplicate questions from the [WikiAnswers Corpus](https://github.com/afader/oqa#wikianswers-corpus), formatted to be easily used with Sentence Transformers to train embedding models. ## Dataset Subsets ### `pair` subset * Columns: "anchor", "positive" * Column types: `str`, `str` * Examples: ```python { 'anchor': 'How many calories is in a handful of strawberries?', 'positive': 'How many calories are in a strawberry popsickles?', } ``` * Collection strategy: Reading the WikiAnswers dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate questions. I've considered all adjacent questions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate questions results in 5 duplicate pairs. * Deduplified: No

# WikiAnswers重复问题数据集卡片 本数据集包含源自[WikiAnswers语料库(WikiAnswers Corpus)](https://github.com/afader/oqa#wikianswers-corpus)的重复问题,其格式经过优化,可直接通过Sentence Transformers开展嵌入模型训练。 ## 数据集子集 ### `pair` 子集 * 列名:"anchor"、"positive" * 列类型:`str`、`str` * 示例: python { 'anchor': 'How many calories is in a handful of strawberries?', 'positive': 'How many calories are in a strawberry popsickles?', } * 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)数据集读取WikiAnswers数据集,该数据集内置重复问题列表。本实现将所有相邻问题视为正样本对,同时将首尾两个问题纳入配对。例如,若存在5个重复问题,则会生成5组重复样本对。 * 去重处理:否
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作