five

FremyCompany/OS-STS-nl-Dataset

收藏
Hugging Face2023-09-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/FremyCompany/OS-STS-nl-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - sentence-similarity language: - nl pretty_name: OpenSubtitles STS Dataset for Dutch size_categories: - 1M<n<10M --- # OpenSubtitles STS Dataset for Dutch OS-STS.nl is an extensive Dutch STS dataset containing over two million sentence pairs and similarity scores. The dataset is automatically extracted from movie and documentary subtitles sourced from OpenSubtitles2018, a vast parallel corpus of aligned video subtitles. Recognizing the high prevalence (>10%) of paraphrased statements and question-and-answer pairs in subtitled spoken language, we systematically extract the consecutive parallel sentence pairs from the subtitles that exhibit significant semantic overlap. ## Content of the dataset The dataset contains Dutch sentence pairs, as well as semtatic similarity scores derived from their English translation derived from sentence-transformers/all-mpnet-base-v2. <div style="max-width: 480px"> ![Coming soon](https://www.wallpaperup.com/uploads/wallpapers/2014/09/26/457767/e1d423323979a1586dfc8c87cd3a5ee0.jpg) </div> **Coming soon**
提供机构:
FremyCompany
原始信息汇总

OpenSubtitles STS Dataset for Dutch

概述

OS-STS.nl 是一个荷兰语的 STS(语义文本相似度)数据集,包含超过两百万个句子对及其相似度分数。该数据集是从 OpenSubtitles2018 中提取的,这是一个庞大的平行语料库,包含对齐的视频字幕。

数据来源

数据集自动提取自电影和纪录片字幕,这些字幕来自 OpenSubtitles2018。由于字幕中的口语表达存在高比例的转述(>10%)和问答对,因此系统地提取了具有显著语义重叠的连续平行句子对。

数据内容

数据集包含荷兰语句子对及其语义相似度分数,这些分数是从其英语翻译中提取的,基于 sentence-transformers/all-mpnet-base-v2 模型。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作