Arabella25/OpenSubtitles

Name: Arabella25/OpenSubtitles
Creator: Arabella25
Published: 2025-08-03 20:06:30
License: 暂无描述

Hugging Face2025-08-03 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/Arabella25/OpenSubtitles

下载链接

链接失效反馈

官方服务：

资源简介：

OpenSubtitles Clean数据集是一个基于OpenSubtitles集合清洗和结构化的平行语料库，专为大型语言模型（LLM）的微调和监督指导调整而准备。该数据集适用于多语言和以英语为中心的生成模型，如Qwen3-1.7B，并支持翻译和指令跟随任务。数据来源于OpenSubtitles项目（OPUS），原包含多种语言的影片和电视字幕。数据经过去重、去除非文本标记和清洗明显的噪音（例如，说话人标签、时间码、HTML标签以及过短或过长的片段）。此外，数据集还过滤掉了不完整、损坏或质量低下的条目。每个条目都表示为一个JSON对象，包含prompt（源句子或上下文，可能包括用于指令调整的系统指令）和response（目标翻译或模型答案）键。

The OpenSubtitles Clean dataset is a cleaned and structured parallel corpus based on the OpenSubtitles collection, specifically prepared for large language model (LLM) fine-tuning and supervised instruction-tuning. The dataset is designed for use with multilingual and English-centric generative models, such as Qwen3-1.7B, and supports both translation and instruction-following tasks. The data source is the OpenSubtitles project (OPUS), originally containing movie and TV subtitles in multiple languages. The data has been deduplicated, stripped of non-textual markup, and cleaned of obvious noise (e.g. speaker tags, timecodes, HTML tags, and excessively short or long fragments). Incomplete, corrupted, or low-quality entries have been filtered out. Each entry is represented as a JSON object with the keys prompt (source sentence or context, may include system instructions for instruction tuning) and response (target translation or model answer).

提供机构：

Arabella25

5,000+

优质数据集

54 个

任务类型

进入经典数据集