FremyCompany/OS-STS-nl-Dataset
收藏Hugging Face2023-09-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/FremyCompany/OS-STS-nl-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- sentence-similarity
language:
- nl
pretty_name: OpenSubtitles STS Dataset for Dutch
size_categories:
- 1M<n<10M
---
# OpenSubtitles STS Dataset for Dutch
OS-STS.nl is an extensive Dutch STS dataset containing over two million sentence pairs and similarity scores.
The dataset is automatically extracted from movie and documentary subtitles sourced from OpenSubtitles2018, a vast parallel corpus of aligned video subtitles.
Recognizing the high prevalence (>10%) of paraphrased statements and question-and-answer pairs in subtitled spoken language, we systematically extract the consecutive parallel sentence pairs from the subtitles that exhibit significant semantic overlap.
## Content of the dataset
The dataset contains Dutch sentence pairs, as well as semtatic similarity scores derived from their English translation derived from sentence-transformers/all-mpnet-base-v2.
<div style="max-width: 480px">

</div>
**Coming soon**
提供机构:
FremyCompany
原始信息汇总
OpenSubtitles STS Dataset for Dutch
概述
OS-STS.nl 是一个荷兰语的 STS(语义文本相似度)数据集,包含超过两百万个句子对及其相似度分数。该数据集是从 OpenSubtitles2018 中提取的,这是一个庞大的平行语料库,包含对齐的视频字幕。
数据来源
数据集自动提取自电影和纪录片字幕,这些字幕来自 OpenSubtitles2018。由于字幕中的口语表达存在高比例的转述(>10%)和问答对,因此系统地提取了具有显著语义重叠的连续平行句子对。
数据内容
数据集包含荷兰语句子对及其语义相似度分数,这些分数是从其英语翻译中提取的,基于 sentence-transformers/all-mpnet-base-v2 模型。



