wisenut-nlp-team/llama_nmt
收藏Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wisenut-nlp-team/llama_nmt
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ko-jp_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 3166308
num_examples: 11999
dataset_size: 3166308
- config_name: ko-jp_technoloy_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 44260625
num_examples: 123455
dataset_size: 44260625
- config_name: ko-jp_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 242542
num_examples: 744
dataset_size: 242542
- config_name: ko-jp_diverse
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 487436681
num_examples: 1200000
dataset_size: 487436681
- config_name: ko-jp_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 107934988
num_examples: 600000
dataset_size: 107934988
- config_name: ko-jp_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 82552662
num_examples: 600000
dataset_size: 82552662
- config_name: ko-jp_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 166244
num_examples: 623
dataset_size: 166244
- config_name: ko-en_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 2892091
num_examples: 12001
dataset_size: 2892091
- config_name: ko-en_technoloy_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 510018661
num_examples: 1200144
dataset_size: 510018661
- config_name: ko-en_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 1299202
num_examples: 2973
dataset_size: 1299202
- config_name: ko-en_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 215413267
num_examples: 1200000
dataset_size: 215413267
- config_name: ko-en_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 73391239
num_examples: 400527
dataset_size: 73391239
- config_name: ko-en_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 97586
num_examples: 358
dataset_size: 97586
- config_name: ko-ch_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 2766397
num_examples: 12000
dataset_size: 2766397
- config_name: ko-ch_technoloy_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 410410429
num_examples: 1156971
dataset_size: 410410429
- config_name: ko-ch_social_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 376231429
num_examples: 1040000
dataset_size: 376231429
- config_name: ko-ch_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 10864676
num_examples: 33785
dataset_size: 10864676
- config_name: ko-ch_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 92917863
num_examples: 600000
dataset_size: 92917863
- config_name: ko-ch_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 73145813
num_examples: 600000
dataset_size: 73145813
- config_name: ko-ch_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 9966501
num_examples: 38729
dataset_size: 9966501
- config_name: jp-ko_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 2476448
num_examples: 12000
dataset_size: 2476448
- config_name: jp-ko_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 251167
num_examples: 738
dataset_size: 251167
- config_name: jp-ko_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 115367572
num_examples: 600000
dataset_size: 115367572
- config_name: jp-ko_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 87129509
num_examples: 360000
dataset_size: 87129509
- config_name: jp-ko_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 238084
num_examples: 612
dataset_size: 238084
- config_name: en-ko_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 2830268
num_examples: 12000
dataset_size: 2830268
- config_name: en-ko_industry_info
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 1004448000
num_examples: 639994
dataset_size: 1004448000
- config_name: en-ko_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 1148107
num_examples: 2817
dataset_size: 1148107
- config_name: en-ko_food
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 432130873
num_examples: 1200000
dataset_size: 432130873
- config_name: en-ko_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 224717393
num_examples: 1200307
dataset_size: 224717393
- config_name: en-ko_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 30873579
num_examples: 121124
dataset_size: 30873579
- config_name: en-ko_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 146957
num_examples: 356
dataset_size: 146957
- config_name: ch-ko_utterance_type
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 2069035
num_examples: 12000
dataset_size: 2069035
- config_name: ch-ko_humanities
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 10460482
num_examples: 33432
dataset_size: 10460482
- config_name: ch-ko_food
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 413112881
num_examples: 1200000
dataset_size: 413112881
- config_name: ch-ko_daily_colloquial
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 98190198
num_examples: 600000
dataset_size: 98190198
- config_name: ch-ko_broadcast
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 53873823
num_examples: 361856
dataset_size: 53873823
- config_name: ch-ko_basic_science
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 13388919
num_examples: 37747
dataset_size: 13388919
configs:
- config_name: ko-jp_utterance_type
data_files:
- split: train
path: translation/ko-jp_utterance_type/*
- config_name: ko-jp_technoloy_science
data_files:
- split: train
path: translation/ko-jp_technoloy_science/*
- config_name: ko-jp_humanities
data_files:
- split: train
path: translation/ko-jp_humanities/*
- config_name: ko-jp_diverse
data_files:
- split: train
path: translation/ko-jp_diverse/*
- config_name: ko-jp_daily_colloquial
data_files:
- split: train
path: translation/ko-jp_daily_colloquial/*
- config_name: ko-jp_broadcast
data_files:
- split: train
path: translation/ko-jp_broadcast/*
- config_name: ko-jp_basic_science
data_files:
- split: train
path: translation/ko-jp_basic_science/*
- config_name: ko-en_utterance_type
data_files:
- split: train
path: translation/ko-en_utterance_type/*
- config_name: ko-en_technoloy_science
data_files:
- split: train
path: translation/ko-en_technoloy_science/*
- config_name: ko-en_humanities
data_files:
- split: train
path: translation/ko-en_humanities/*
- config_name: ko-en_daily_colloquial
data_files:
- split: train
path: translation/ko-en_daily_colloquial/*
- config_name: ko-en_broadcast
data_files:
- split: train
path: translation/ko-en_broadcast/*
- config_name: ko-en_basic_science
data_files:
- split: train
path: translation/ko-en_basic_science/*
- config_name: ko-ch_utterance_type
data_files:
- split: train
path: translation/ko-ch_utterance_type/*
- config_name: ko-ch_technoloy_science
data_files:
- split: train
path: translation/ko-ch_technoloy_science/*
- config_name: ko-ch_social_science
data_files:
- split: train
path: translation/ko-ch_social_science/*
- config_name: ko-ch_humanities
data_files:
- split: train
path: translation/ko-ch_humanities/*
- config_name: ko-ch_daily_colloquial
data_files:
- split: train
path: translation/ko-ch_daily_colloquial/*
- config_name: ko-ch_broadcast
data_files:
- split: train
path: translation/ko-ch_broadcast/*
- config_name: ko-ch_basic_science
data_files:
- split: train
path: translation/ko-ch_basic_science/*
- config_name: jp-ko_utterance_type
data_files:
- split: train
path: translation/jp-ko_utterance_type/*
- config_name: jp-ko_humanities
data_files:
- split: train
path: translation/jp-ko_humanities/*
- config_name: jp-ko_daily_colloquial
data_files:
- split: train
path: translation/jp-ko_daily_colloquial/*
- config_name: jp-ko_broadcast
data_files:
- split: train
path: translation/jp-ko_broadcast/*
- config_name: jp-ko_basic_science
data_files:
- split: train
path: translation/jp-ko_basic_science/*
- config_name: en-ko_utterance_type
data_files:
- split: train
path: translation/en-ko_utterance_type/*
- config_name: en-ko_industry_info
data_files:
- split: train
path: translation/en-ko_industry_info/*
- config_name: en-ko_humanities
data_files:
- split: train
path: translation/en-ko_humanities/*
- config_name: en-ko_food
data_files:
- split: train
path: translation/en-ko_food/*
- config_name: en-ko_daily_colloquial
data_files:
- split: train
path: translation/en-ko_daily_colloquial/*
- config_name: en-ko_broadcast
data_files:
- split: train
path: translation/en-ko_broadcast/*
- config_name: en-ko_basic_science
data_files:
- split: train
path: translation/en-ko_basic_science/*
- config_name: ch-ko_utterance_type
data_files:
- split: train
path: translation/ch-ko_utterance_type/*
- config_name: ch-ko_humanities
data_files:
- split: train
path: translation/ch-ko_humanities/*
- config_name: ch-ko_food
data_files:
- split: train
path: translation/ch-ko_food/*
- config_name: ch-ko_daily_colloquial
data_files:
- split: train
path: translation/ch-ko_daily_colloquial/*
- config_name: ch-ko_broadcast
data_files:
- split: train
path: translation/ch-ko_broadcast/*
- config_name: ch-ko_basic_science
data_files:
- split: train
path: translation/ch-ko_basic_science/*
---
## 중-한 번역
- subset: ch-ko_basic_science
- length: 37.7k
- subset: ch-ko_broadcast
- length: 362k
- subset: ch-ko_daily_colloquial
- length: 600k
- subset: ch-ko_food
- length: 1.2M
- subset: ch-ko_humanities
- length: 33.4k
- subset: ch-ko_utterance_type
- length: 12k
## 영-한 번역
- subset: en-ko_basic_science
- length: 356
- subset: en-ko_broadcast
- length: 121k
- subset: en-ko_daily_colloquial
- length: 1.2M
- subset: en-ko_food
- length: 1.2M
- subset: en-ko_humanities
- length: 2.82k
- subset: en-ko_industry_info
- length: 640k
- subset: en-ko_utterance_type
- length: 12k
## 일-한 번역
- subset: jp-ko_basic_science
- length: 612
- subset: jp-ko_broadcast
- length: 360k
- subset: jp-ko_daily_colloquial
- length: 600k
- subset: jp-ko_humanities
- length: 738
- subset: jp-ko_utterance_type
- length: 12k
## 한-중 번역
- subset: ko-ch_basic_science
- length: 38.7k
- subset: ko-ch_broadcast
- length: 600k
- subset: ko-ch_daily_colloquial
- length: 600k
- subset: ko-ch_humanities
- length: 33.8k
- subset: ko-ch_social_science
- length: 1.04M
- subset: ko-ch_technology_science
- length: 1.16M
- subset: ko-ch_utterance_type
- length: 12k
## 한-영 번역
- subset: ko-en_basic_science
- length: 358
- subset: ko-en_broadcast
- length: 401k
- subset: ko-en_daily_colloquial
- length: 1.2M
- subset: ko-en_humanities
- length: 2.97k
- subset: ko-en_technology_science
- length: 1.2M
- subset: ko-en_utterance_type
- length: 12k
## 한-일 번역
- subset: ko-jp_basic_science
- length: 623
- subset: ko-jp_broadcast
- length: 600k
- subset: ko-jp_daily_colloquial
- length: 600k
- subset: ko-jp_diverse
- length: 1.2M
- subset: ko-jp_humanities
- length: 744
- subset: ko-jp_technology_science
- length: 123k
- subset: ko-jp_utterance_type
- length: 12k
提供机构:
wisenut-nlp-team
原始信息汇总
数据集概述
数据集配置信息
- 配置名称: 包含多种语言对,如ko-jp_utterance_type, ko-en_technoloy_science等。
- 特征:
- instruction: 数据类型为string。
- input: 数据类型为string。
- output: 数据类型为string。
数据集分割
- 分割类型: 主要为train。
- 数据量: 每个配置的train分割包含的示例数量和字节数不同,例如:
- ko-jp_utterance_type: 11999个示例,3166308字节。
- ko-en_technoloy_science: 123455个示例,44260625字节。
数据集大小
- 总大小: 每个配置的总数据大小与其train分割的字节数相同,例如:
- ko-jp_utterance_type: 3166308字节。
- ko-en_technoloy_science: 44260625字节。
数据文件路径
- 路径: 每个配置的数据文件路径格式为
translation/{config_name}/*,例如:- ko-jp_utterance_type:
translation/ko-jp_utterance_type/* - ko-en_technoloy_science:
translation/ko-en_technoloy_science/*
- ko-jp_utterance_type:
数据集详细信息
韩-日翻译
- ko-jp_utterance_type: 11999个示例,3166308字节。
- ko-jp_technoloy_science: 123455个示例,44260625字节。
- ko-jp_humanities: 744个示例,242542字节。
- ko-jp_diverse: 1200000个示例,487436681字节。
- ko-jp_daily_colloquial: 600000个示例,107934988字节。
- ko-jp_broadcast: 600000个示例,82552662字节。
- ko-jp_basic_science: 623个示例,166244字节。
韩-英翻译
- ko-en_utterance_type: 12001个示例,2892091字节。
- ko-en_technoloy_science: 1200144个示例,510018661字节。
- ko-en_humanities: 2973个示例,1299202字节。
- ko-en_daily_colloquial: 1200000个示例,215413267字节。
- ko-en_broadcast: 400527个示例,73391239字节。
- ko-en_basic_science: 358个示例,97586字节。
韩-中翻译
- ko-ch_utterance_type: 12000个示例,2766397字节。
- ko-ch_technoloy_science: 1156971个示例,410410429字节。
- ko-ch_social_science: 1040000个示例,376231429字节。
- ko-ch_humanities: 33785个示例,10864676字节。
- ko-ch_daily_colloquial: 600000个示例,92917863字节。
- ko-ch_broadcast: 600000个示例,73145813字节。
- ko-ch_basic_science: 38729个示例,9966501字节。
日-韩翻译
- jp-ko_utterance_type: 12000个示例,2476448字节。
- jp-ko_humanities: 738个示例,251167字节。
- jp-ko_daily_colloquial: 600000个示例,115367572字节。
- jp-ko_broadcast: 360000个示例,87129509字节。
- jp-ko_basic_science: 612个示例,238084字节。
英-韩翻译
- en-ko_utterance_type: 12000个示例,2830268字节。
- en-ko_industry_info: 639994个示例,1004448000字节。
- en-ko_humanities: 2817个示例,1148107字节。
- en-ko_food: 1200000个示例,432130873字节。
- en-ko_daily_colloquial: 1200307个示例,224717393字节。
- en-ko_broadcast: 121124个示例,30873579字节。
- en-ko_basic_science: 356个示例,146957字节。
中-韩翻译
- ch-ko_utterance_type: 12000个示例,2069035字节。
- ch-ko_humanities: 33432个示例,10460482字节。
- ch-ko_food: 1200000个示例,413112881字节。
- ch-ko_daily_colloquial: 600000个示例,98190198字节。
- ch-ko_broadcast: 361856个示例,53873823字节。
- ch-ko_basic_science: 37747个示例,13388919字节。



