FremyCompany/OS-STS-nl-Dataset

Name: FremyCompany/OS-STS-nl-Dataset
Creator: FremyCompany
Published: 2023-09-22 08:36:12
License: 暂无描述

Hugging Face2023-09-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/FremyCompany/OS-STS-nl-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - sentence-similarity language: - nl pretty_name: OpenSubtitles STS Dataset for Dutch size_categories: - 1M<n<10M --- # OpenSubtitles STS Dataset for Dutch OS-STS.nl is an extensive Dutch STS dataset containing over two million sentence pairs and similarity scores. The dataset is automatically extracted from movie and documentary subtitles sourced from OpenSubtitles2018, a vast parallel corpus of aligned video subtitles. Recognizing the high prevalence (>10%) of paraphrased statements and question-and-answer pairs in subtitled spoken language, we systematically extract the consecutive parallel sentence pairs from the subtitles that exhibit significant semantic overlap. ## Content of the dataset The dataset contains Dutch sentence pairs, as well as semtatic similarity scores derived from their English translation derived from sentence-transformers/all-mpnet-base-v2. <div style="max-width: 480px"> ![Coming soon](https://www.wallpaperup.com/uploads/wallpapers/2014/09/26/457767/e1d423323979a1586dfc8c87cd3a5ee0.jpg) </div> **Coming soon**

提供机构：

FremyCompany

原始信息汇总

OpenSubtitles STS Dataset for Dutch

概述

OS-STS.nl 是一个荷兰语的 STS（语义文本相似度）数据集，包含超过两百万个句子对及其相似度分数。该数据集是从 OpenSubtitles2018 中提取的，这是一个庞大的平行语料库，包含对齐的视频字幕。

数据来源

数据集自动提取自电影和纪录片字幕，这些字幕来自 OpenSubtitles2018。由于字幕中的口语表达存在高比例的转述（>10%）和问答对，因此系统地提取了具有显著语义重叠的连续平行句子对。

数据内容

数据集包含荷兰语句子对及其语义相似度分数，这些分数是从其英语翻译中提取的，基于 sentence-transformers/all-mpnet-base-v2 模型。

5,000+

优质数据集

54 个

任务类型

进入经典数据集