wikianswers-duplicates
收藏魔搭社区2025-11-07 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/wikianswers-duplicates
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for WikiAnswers Duplicate Questions
This dataset contains duplicate questions from the [WikiAnswers Corpus](https://github.com/afader/oqa#wikianswers-corpus), formatted to be easily used with Sentence Transformers to train embedding models.
## Dataset Subsets
### `pair` subset
* Columns: "anchor", "positive"
* Column types: `str`, `str`
* Examples:
```python
{
'anchor': 'How many calories is in a handful of strawberries?',
'positive': 'How many calories are in a strawberry popsickles?',
}
```
* Collection strategy: Reading the WikiAnswers dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate questions. I've considered all adjacent questions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate questions results in 5 duplicate pairs.
* Deduplified: No
# WikiAnswers重复问题数据集卡片
本数据集包含源自[WikiAnswers语料库(WikiAnswers Corpus)](https://github.com/afader/oqa#wikianswers-corpus)的重复问题,其格式经过优化,可直接通过Sentence Transformers开展嵌入模型训练。
## 数据集子集
### `pair` 子集
* 列名:"anchor"、"positive"
* 列类型:`str`、`str`
* 示例:
python
{
'anchor': 'How many calories is in a handful of strawberries?',
'positive': 'How many calories are in a strawberry popsickles?',
}
* 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)数据集读取WikiAnswers数据集,该数据集内置重复问题列表。本实现将所有相邻问题视为正样本对,同时将首尾两个问题纳入配对。例如,若存在5个重复问题,则会生成5组重复样本对。
* 去重处理:否
提供机构:
maas
创建时间:
2025-01-06



