wikipedia-sections
收藏魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/wikipedia-sections
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Wikipedia Sections
This dataset contains pairs and triplets that can be used to train and finetune Sentence Transformer embedding models. The dataset originates from [Dor et al.](https://aclanthology.org/P18-2009.pdf), and was downloaded from [this download link](https://sbert.net/datasets/wikipedia-sections-triplets.zip).
Notably, the "anchor" column contains sentences from Wikipedia, wheras the "positive" column contains other sentences from the same section. The "negative" column contains sentences from other sections.
## Dataset Subsets
### `pair` subset
* Columns: "anchor", "positive"
* Column types: `str`, `str`
* Examples:
```python
```
* Collection strategy: Reading the Wikipedia Sections dataset from https://sbert.net.
* Deduplified: Yes
### `triplet` subset
* Columns: "anchor", "positive", "negative"
* Column types: `str`, `str`, `str`
* Examples:
```python
```
* Collection strategy: Reading the Wikipedia Sections dataset from https://sbert.net.
* Deduplified: Yes
# 维基百科分段数据集卡片
本数据集包含可用于训练与微调句子转换器(Sentence Transformer)嵌入模型的样本对与样本三元组。本数据集源自Dor等人的研究([Dor et al.](https://aclanthology.org/P18-2009.pdf)),并从[该下载链接](https://sbert.net/datasets/wikipedia-sections-triplets.zip)获取。
值得注意的是,「锚点(anchor)」列包含来自维基百科的句子,「正样本(positive)」列包含同一段落下的其他句子,而「负样本(negative)」列则包含来自其他段落的句子。
## 数据集子集
### `pair` 子集
* 列名:"anchor", "positive"
* 列类型:`str`, `str`
* 示例:
python
* 采集策略:从https://sbert.net获取维基百科分段数据集
* 去重情况:是
### `triplet` 子集
* 列名:"anchor", "positive", "negative"
* 列类型:`str`, `str`, `str`
* 示例:
python
* 采集策略:从https://sbert.net获取维基百科分段数据集
* 去重情况:是
提供机构:
maas
创建时间:
2025-01-06



