vietgpt/open_subtitles_envi
收藏Hugging Face2023-07-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/vietgpt/open_subtitles_envi
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: en
dtype: string
- name: vi
dtype: string
splits:
- name: train
num_bytes: 280063489
num_examples: 3505276
download_size: 176803145
dataset_size: 280063489
task_categories:
- translation
language:
- en
- vi
tags:
- LM
size_categories:
- 1M<n<10M
---
# OpenSubtitles
- Source: https://huggingface.co/datasets/open_subtitles
- Num examples: 3,505,276
- Language: English
```python
from datasets import load_dataset
load_dataset("tdtunlp/open_subtitles_envi")
```
- Format for Translation task
```python
def preprocess(
sample,
instruction_key="### Instruction:",
input_key="Input:",
response_key="<|endofprompt|>",
end_key="<|endoftext|>",
en2vi=True,
):
if en2vi:
if random.random() < 0.5:
instruction = "Translate the following sentences from English into Vietnamese."
else:
instruction = "Dịch các câu sau từ tiếng Anh sang tiếng Việt."
input = sample['en'].strip()
response = sample['vi'].strip()
else:
if random.random() < 0.5:
instruction = "Translate the following sentences from Vietnamese into English."
else:
instruction = "Dịch các câu sau từ tiếng Việt sang tiếng Anh."
input = sample['vi'].strip()
response = sample['en'].strip()
return {'text': """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
{instruction_key}
{instruction}
{input_key}
{input}
{response_key}
{response}
{end_key}""".format(
instruction_key=instruction_key,
instruction=instruction,
input_key=input_key,
input=input,
response_key=response_key,
response=response,
end_key=end_key,
)}
"""
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Dịch các câu sau từ tiếng Anh sang tiếng Việt.
Input:
Line up, I say!
<|endofprompt|>
Sắp hàng, nghe chưa!
<|endoftext|>
"""
```
提供机构:
vietgpt
原始信息汇总
数据集概述
基本信息
- 名称: OpenSubtitles
- 任务类别: 翻译
- 语言: 英语 (en), 越南语 (vi)
- 标签: LM
- 大小类别: 1M<n<10M
数据结构
- 特征:
en: 字符串类型vi: 字符串类型
数据划分
- 训练集:
- 示例数量: 3,505,276
- 数据大小: 280,063,489字节
下载信息
- 下载大小: 176,803,145字节
- 数据集大小: 280,063,489字节



