Arabella25/OpenSubtitles
收藏Hugging Face2025-08-03 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Arabella25/OpenSubtitles
下载链接
链接失效反馈官方服务:
资源简介:
OpenSubtitles Clean数据集是一个基于OpenSubtitles集合清洗和结构化的平行语料库,专为大型语言模型(LLM)的微调和监督指导调整而准备。该数据集适用于多语言和以英语为中心的生成模型,如Qwen3-1.7B,并支持翻译和指令跟随任务。数据来源于OpenSubtitles项目(OPUS),原包含多种语言的影片和电视字幕。数据经过去重、去除非文本标记和清洗明显的噪音(例如,说话人标签、时间码、HTML标签以及过短或过长的片段)。此外,数据集还过滤掉了不完整、损坏或质量低下的条目。每个条目都表示为一个JSON对象,包含prompt(源句子或上下文,可能包括用于指令调整的系统指令)和response(目标翻译或模型答案)键。
The OpenSubtitles Clean dataset is a cleaned and structured parallel corpus based on the OpenSubtitles collection, specifically prepared for large language model (LLM) fine-tuning and supervised instruction-tuning. The dataset is designed for use with multilingual and English-centric generative models, such as Qwen3-1.7B, and supports both translation and instruction-following tasks. The data source is the OpenSubtitles project (OPUS), originally containing movie and TV subtitles in multiple languages. The data has been deduplicated, stripped of non-textual markup, and cleaned of obvious noise (e.g. speaker tags, timecodes, HTML tags, and excessively short or long fragments). Incomplete, corrupted, or low-quality entries have been filtered out. Each entry is represented as a JSON object with the keys prompt (source sentence or context, may include system instructions for instruction tuning) and response (target translation or model answer).
提供机构:
Arabella25



