wikipedia-sections

Name: wikipedia-sections
Creator: maas
Published: 2025-11-12 16:19:48
License: 暂无描述

魔搭社区2025-11-12 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/sentence-transformers/wikipedia-sections

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Wikipedia Sections This dataset contains pairs and triplets that can be used to train and finetune Sentence Transformer embedding models. The dataset originates from [Dor et al.](https://aclanthology.org/P18-2009.pdf), and was downloaded from [this download link](https://sbert.net/datasets/wikipedia-sections-triplets.zip). Notably, the "anchor" column contains sentences from Wikipedia, wheras the "positive" column contains other sentences from the same section. The "negative" column contains sentences from other sections. ## Dataset Subsets ### `pair` subset * Columns: "anchor", "positive" * Column types: `str`, `str` * Examples: ```python ``` * Collection strategy: Reading the Wikipedia Sections dataset from https://sbert.net. * Deduplified: Yes ### `triplet` subset * Columns: "anchor", "positive", "negative" * Column types: `str`, `str`, `str` * Examples: ```python ``` * Collection strategy: Reading the Wikipedia Sections dataset from https://sbert.net. * Deduplified: Yes

# 维基百科分段数据集卡片本数据集包含可用于训练与微调句子转换器（Sentence Transformer）嵌入模型的样本对与样本三元组。本数据集源自Dor等人的研究（[Dor et al.](https://aclanthology.org/P18-2009.pdf)），并从[该下载链接](https://sbert.net/datasets/wikipedia-sections-triplets.zip)获取。值得注意的是，「锚点（anchor）」列包含来自维基百科的句子，「正样本（positive）」列包含同一段落下的其他句子，而「负样本（negative）」列则包含来自其他段落的句子。 ## 数据集子集 ### `pair` 子集 * 列名："anchor", "positive" * 列类型：`str`, `str` * 示例： python * 采集策略：从https://sbert.net获取维基百科分段数据集 * 去重情况：是 ### `triplet` 子集 * 列名："anchor", "positive", "negative" * 列类型：`str`, `str`, `str` * 示例： python * 采集策略：从https://sbert.net获取维基百科分段数据集 * 去重情况：是

提供机构：

maas

创建时间：

2025-01-06

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集包含用于训练Sentence Transformer嵌入模型的句子对和三元组，数据源自维基百科，其中'anchor'列为维基百科句子，'positive'列为同一章节的其他句子，'negative'列为其他章节的句子。数据集提供pair和triplet两个子集，并采用Apache 2.0许可证。

以上内容由遇见数据集搜集并总结生成