five

vietgpt/open_subtitles_envi

收藏
Hugging Face2023-07-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/vietgpt/open_subtitles_envi
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: en dtype: string - name: vi dtype: string splits: - name: train num_bytes: 280063489 num_examples: 3505276 download_size: 176803145 dataset_size: 280063489 task_categories: - translation language: - en - vi tags: - LM size_categories: - 1M<n<10M --- # OpenSubtitles - Source: https://huggingface.co/datasets/open_subtitles - Num examples: 3,505,276 - Language: English ```python from datasets import load_dataset load_dataset("tdtunlp/open_subtitles_envi") ``` - Format for Translation task ```python def preprocess( sample, instruction_key="### Instruction:", input_key="Input:", response_key="<|endofprompt|>", end_key="<|endoftext|>", en2vi=True, ): if en2vi: if random.random() < 0.5: instruction = "Translate the following sentences from English into Vietnamese." else: instruction = "Dịch các câu sau từ tiếng Anh sang tiếng Việt." input = sample['en'].strip() response = sample['vi'].strip() else: if random.random() < 0.5: instruction = "Translate the following sentences from Vietnamese into English." else: instruction = "Dịch các câu sau từ tiếng Việt sang tiếng Anh." input = sample['vi'].strip() response = sample['en'].strip() return {'text': """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. {instruction_key} {instruction} {input_key} {input} {response_key} {response} {end_key}""".format( instruction_key=instruction_key, instruction=instruction, input_key=input_key, input=input, response_key=response_key, response=response, end_key=end_key, )} """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Dịch các câu sau từ tiếng Anh sang tiếng Việt. Input: Line up, I say! <|endofprompt|> Sắp hàng, nghe chưa! <|endoftext|> """ ```
提供机构:
vietgpt
原始信息汇总

数据集概述

基本信息

  • 名称: OpenSubtitles
  • 任务类别: 翻译
  • 语言: 英语 (en), 越南语 (vi)
  • 标签: LM
  • 大小类别: 1M<n<10M

数据结构

  • 特征:
    • en: 字符串类型
    • vi: 字符串类型

数据划分

  • 训练集:
    • 示例数量: 3,505,276
    • 数据大小: 280,063,489字节

下载信息

  • 下载大小: 176,803,145字节
  • 数据集大小: 280,063,489字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作