sentence-transformers/wikianswers-duplicates
收藏Hugging Face2024-05-01 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sentence-transformers/wikianswers-duplicates
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
multilinguality:
- monolingual
size_categories:
- 100M<n<1B
task_categories:
- feature-extraction
- sentence-similarity
pretty_name: WikiAnswers Duplicate Questions
tags:
- sentence-transformers
dataset_info:
config_name: pair
features:
- name: anchor
dtype: string
- name: positive
dtype: string
splits:
- name: train
num_bytes: 78825722188
num_examples: 761379586
download_size: 33891162136
dataset_size: 78825722188
configs:
- config_name: pair
data_files:
- split: train
path: pair/train-*
---
# Dataset Card for WikiAnswers Duplicate Questions
This dataset contains duplicate questions from the [WikiAnswers Corpus](https://github.com/afader/oqa#wikianswers-corpus), formatted to be easily used with Sentence Transformers to train embedding models.
## Dataset Subsets
### `pair` subset
* Columns: "anchor", "positive"
* Column types: `str`, `str`
* Examples:
```python
{
'anchor': 'How many calories is in a handful of strawberries?',
'positive': 'How many calories are in a strawberry popsickles?',
}
```
* Collection strategy: Reading the WikiAnswers dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate questions. I've considered all adjacent questions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate questions results in 5 duplicate pairs.
* Deduplified: No
提供机构:
sentence-transformers
原始信息汇总
WikiAnswers Duplicate Questions 数据集概述
基本信息
- 语言: 英语
- 多语言性: 单语种
- 数据集大小: 100M<n<1B
- 任务类别: 特征提取, 句子相似度
- 标签: sentence-transformers
数据集详情
- 配置名称: pair
- 特征:
- anchor: 字符串类型
- positive: 字符串类型
- 分割:
- 训练集:
- 字节数: 78825722188
- 样本数: 761379586
- 训练集:
- 下载大小: 33891162136
- 数据集大小: 78825722188
数据集子集
pair 子集
-
列: "anchor", "positive"
-
列类型: 字符串, 字符串
-
示例: python { anchor: How many calories is in a handful of strawberries?, positive: How many calories are in a strawberry popsickles?, }
-
收集策略: 从 embedding-training-data 读取 WikiAnswers 数据集,该数据集包含重复问题列表。将所有相邻问题视为正样本对,加上最后一个和第一个标题。例如,5个重复问题形成5个重复对。
-
去重: 否



