five

Arabella25/OpenSubtitles

收藏
Hugging Face2025-08-03 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/Arabella25/OpenSubtitles
下载链接
链接失效反馈
官方服务:
资源简介:
OpenSubtitles Clean数据集是一个基于OpenSubtitles集合清洗和结构化的平行语料库,专为大型语言模型(LLM)的微调和监督指导调整而准备。该数据集适用于多语言和以英语为中心的生成模型,如Qwen3-1.7B,并支持翻译和指令跟随任务。数据来源于OpenSubtitles项目(OPUS),原包含多种语言的影片和电视字幕。数据经过去重、去除非文本标记和清洗明显的噪音(例如,说话人标签、时间码、HTML标签以及过短或过长的片段)。此外,数据集还过滤掉了不完整、损坏或质量低下的条目。每个条目都表示为一个JSON对象,包含prompt(源句子或上下文,可能包括用于指令调整的系统指令)和response(目标翻译或模型答案)键。

The OpenSubtitles Clean dataset is a cleaned and structured parallel corpus based on the OpenSubtitles collection, specifically prepared for large language model (LLM) fine-tuning and supervised instruction-tuning. The dataset is designed for use with multilingual and English-centric generative models, such as Qwen3-1.7B, and supports both translation and instruction-following tasks. The data source is the OpenSubtitles project (OPUS), originally containing movie and TV subtitles in multiple languages. The data has been deduplicated, stripped of non-textual markup, and cleaned of obvious noise (e.g. speaker tags, timecodes, HTML tags, and excessively short or long fragments). Incomplete, corrupted, or low-quality entries have been filtered out. Each entry is represented as a JSON object with the keys prompt (source sentence or context, may include system instructions for instruction tuning) and response (target translation or model answer).
提供机构:
Arabella25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作