wisenut-nlp-team/llama_nmt

Name: wisenut-nlp-team/llama_nmt
Creator: wisenut-nlp-team
Published: 2024-05-02 00:03:40
License: 暂无描述

Hugging Face2024-05-02 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/wisenut-nlp-team/llama_nmt

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: ko-jp_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 3166308 num_examples: 11999 dataset_size: 3166308 - config_name: ko-jp_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 44260625 num_examples: 123455 dataset_size: 44260625 - config_name: ko-jp_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 242542 num_examples: 744 dataset_size: 242542 - config_name: ko-jp_diverse features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 487436681 num_examples: 1200000 dataset_size: 487436681 - config_name: ko-jp_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 107934988 num_examples: 600000 dataset_size: 107934988 - config_name: ko-jp_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 82552662 num_examples: 600000 dataset_size: 82552662 - config_name: ko-jp_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 166244 num_examples: 623 dataset_size: 166244 - config_name: ko-en_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2892091 num_examples: 12001 dataset_size: 2892091 - config_name: ko-en_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 510018661 num_examples: 1200144 dataset_size: 510018661 - config_name: ko-en_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1299202 num_examples: 2973 dataset_size: 1299202 - config_name: ko-en_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 215413267 num_examples: 1200000 dataset_size: 215413267 - config_name: ko-en_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 73391239 num_examples: 400527 dataset_size: 73391239 - config_name: ko-en_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 97586 num_examples: 358 dataset_size: 97586 - config_name: ko-ch_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2766397 num_examples: 12000 dataset_size: 2766397 - config_name: ko-ch_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 410410429 num_examples: 1156971 dataset_size: 410410429 - config_name: ko-ch_social_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 376231429 num_examples: 1040000 dataset_size: 376231429 - config_name: ko-ch_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10864676 num_examples: 33785 dataset_size: 10864676 - config_name: ko-ch_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 92917863 num_examples: 600000 dataset_size: 92917863 - config_name: ko-ch_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 73145813 num_examples: 600000 dataset_size: 73145813 - config_name: ko-ch_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 9966501 num_examples: 38729 dataset_size: 9966501 - config_name: jp-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2476448 num_examples: 12000 dataset_size: 2476448 - config_name: jp-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 251167 num_examples: 738 dataset_size: 251167 - config_name: jp-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 115367572 num_examples: 600000 dataset_size: 115367572 - config_name: jp-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 87129509 num_examples: 360000 dataset_size: 87129509 - config_name: jp-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 238084 num_examples: 612 dataset_size: 238084 - config_name: en-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2830268 num_examples: 12000 dataset_size: 2830268 - config_name: en-ko_industry_info features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1004448000 num_examples: 639994 dataset_size: 1004448000 - config_name: en-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1148107 num_examples: 2817 dataset_size: 1148107 - config_name: en-ko_food features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 432130873 num_examples: 1200000 dataset_size: 432130873 - config_name: en-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 224717393 num_examples: 1200307 dataset_size: 224717393 - config_name: en-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 30873579 num_examples: 121124 dataset_size: 30873579 - config_name: en-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 146957 num_examples: 356 dataset_size: 146957 - config_name: ch-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2069035 num_examples: 12000 dataset_size: 2069035 - config_name: ch-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10460482 num_examples: 33432 dataset_size: 10460482 - config_name: ch-ko_food features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 413112881 num_examples: 1200000 dataset_size: 413112881 - config_name: ch-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 98190198 num_examples: 600000 dataset_size: 98190198 - config_name: ch-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 53873823 num_examples: 361856 dataset_size: 53873823 - config_name: ch-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13388919 num_examples: 37747 dataset_size: 13388919 configs: - config_name: ko-jp_utterance_type data_files: - split: train path: translation/ko-jp_utterance_type/* - config_name: ko-jp_technoloy_science data_files: - split: train path: translation/ko-jp_technoloy_science/* - config_name: ko-jp_humanities data_files: - split: train path: translation/ko-jp_humanities/* - config_name: ko-jp_diverse data_files: - split: train path: translation/ko-jp_diverse/* - config_name: ko-jp_daily_colloquial data_files: - split: train path: translation/ko-jp_daily_colloquial/* - config_name: ko-jp_broadcast data_files: - split: train path: translation/ko-jp_broadcast/* - config_name: ko-jp_basic_science data_files: - split: train path: translation/ko-jp_basic_science/* - config_name: ko-en_utterance_type data_files: - split: train path: translation/ko-en_utterance_type/* - config_name: ko-en_technoloy_science data_files: - split: train path: translation/ko-en_technoloy_science/* - config_name: ko-en_humanities data_files: - split: train path: translation/ko-en_humanities/* - config_name: ko-en_daily_colloquial data_files: - split: train path: translation/ko-en_daily_colloquial/* - config_name: ko-en_broadcast data_files: - split: train path: translation/ko-en_broadcast/* - config_name: ko-en_basic_science data_files: - split: train path: translation/ko-en_basic_science/* - config_name: ko-ch_utterance_type data_files: - split: train path: translation/ko-ch_utterance_type/* - config_name: ko-ch_technoloy_science data_files: - split: train path: translation/ko-ch_technoloy_science/* - config_name: ko-ch_social_science data_files: - split: train path: translation/ko-ch_social_science/* - config_name: ko-ch_humanities data_files: - split: train path: translation/ko-ch_humanities/* - config_name: ko-ch_daily_colloquial data_files: - split: train path: translation/ko-ch_daily_colloquial/* - config_name: ko-ch_broadcast data_files: - split: train path: translation/ko-ch_broadcast/* - config_name: ko-ch_basic_science data_files: - split: train path: translation/ko-ch_basic_science/* - config_name: jp-ko_utterance_type data_files: - split: train path: translation/jp-ko_utterance_type/* - config_name: jp-ko_humanities data_files: - split: train path: translation/jp-ko_humanities/* - config_name: jp-ko_daily_colloquial data_files: - split: train path: translation/jp-ko_daily_colloquial/* - config_name: jp-ko_broadcast data_files: - split: train path: translation/jp-ko_broadcast/* - config_name: jp-ko_basic_science data_files: - split: train path: translation/jp-ko_basic_science/* - config_name: en-ko_utterance_type data_files: - split: train path: translation/en-ko_utterance_type/* - config_name: en-ko_industry_info data_files: - split: train path: translation/en-ko_industry_info/* - config_name: en-ko_humanities data_files: - split: train path: translation/en-ko_humanities/* - config_name: en-ko_food data_files: - split: train path: translation/en-ko_food/* - config_name: en-ko_daily_colloquial data_files: - split: train path: translation/en-ko_daily_colloquial/* - config_name: en-ko_broadcast data_files: - split: train path: translation/en-ko_broadcast/* - config_name: en-ko_basic_science data_files: - split: train path: translation/en-ko_basic_science/* - config_name: ch-ko_utterance_type data_files: - split: train path: translation/ch-ko_utterance_type/* - config_name: ch-ko_humanities data_files: - split: train path: translation/ch-ko_humanities/* - config_name: ch-ko_food data_files: - split: train path: translation/ch-ko_food/* - config_name: ch-ko_daily_colloquial data_files: - split: train path: translation/ch-ko_daily_colloquial/* - config_name: ch-ko_broadcast data_files: - split: train path: translation/ch-ko_broadcast/* - config_name: ch-ko_basic_science data_files: - split: train path: translation/ch-ko_basic_science/* --- ## 중-한 번역 - subset: ch-ko_basic_science - length: 37.7k - subset: ch-ko_broadcast - length: 362k - subset: ch-ko_daily_colloquial - length: 600k - subset: ch-ko_food - length: 1.2M - subset: ch-ko_humanities - length: 33.4k - subset: ch-ko_utterance_type - length: 12k ## 영-한 번역 - subset: en-ko_basic_science - length: 356 - subset: en-ko_broadcast - length: 121k - subset: en-ko_daily_colloquial - length: 1.2M - subset: en-ko_food - length: 1.2M - subset: en-ko_humanities - length: 2.82k - subset: en-ko_industry_info - length: 640k - subset: en-ko_utterance_type - length: 12k ## 일-한 번역 - subset: jp-ko_basic_science - length: 612 - subset: jp-ko_broadcast - length: 360k - subset: jp-ko_daily_colloquial - length: 600k - subset: jp-ko_humanities - length: 738 - subset: jp-ko_utterance_type - length: 12k ## 한-중 번역 - subset: ko-ch_basic_science - length: 38.7k - subset: ko-ch_broadcast - length: 600k - subset: ko-ch_daily_colloquial - length: 600k - subset: ko-ch_humanities - length: 33.8k - subset: ko-ch_social_science - length: 1.04M - subset: ko-ch_technology_science - length: 1.16M - subset: ko-ch_utterance_type - length: 12k ## 한-영 번역 - subset: ko-en_basic_science - length: 358 - subset: ko-en_broadcast - length: 401k - subset: ko-en_daily_colloquial - length: 1.2M - subset: ko-en_humanities - length: 2.97k - subset: ko-en_technology_science - length: 1.2M - subset: ko-en_utterance_type - length: 12k ## 한-일 번역 - subset: ko-jp_basic_science - length: 623 - subset: ko-jp_broadcast - length: 600k - subset: ko-jp_daily_colloquial - length: 600k - subset: ko-jp_diverse - length: 1.2M - subset: ko-jp_humanities - length: 744 - subset: ko-jp_technology_science - length: 123k - subset: ko-jp_utterance_type - length: 12k

提供机构：

wisenut-nlp-team

原始信息汇总

数据集概述

数据集配置信息

配置名称: 包含多种语言对，如ko-jp_utterance_type, ko-en_technoloy_science等。
特征:
- instruction: 数据类型为string。
- input: 数据类型为string。
- output: 数据类型为string。

数据集分割

分割类型: 主要为train。
数据量: 每个配置的train分割包含的示例数量和字节数不同，例如:
- ko-jp_utterance_type: 11999个示例，3166308字节。
- ko-en_technoloy_science: 123455个示例，44260625字节。

数据集大小

总大小: 每个配置的总数据大小与其train分割的字节数相同，例如:
- ko-jp_utterance_type: 3166308字节。
- ko-en_technoloy_science: 44260625字节。

数据文件路径

路径: 每个配置的数据文件路径格式为translation/{config_name}/*，例如:
- ko-jp_utterance_type: translation/ko-jp_utterance_type/*
- ko-en_technoloy_science: translation/ko-en_technoloy_science/*

数据集详细信息

韩-日翻译

ko-jp_utterance_type: 11999个示例，3166308字节。
ko-jp_technoloy_science: 123455个示例，44260625字节。
ko-jp_humanities: 744个示例，242542字节。
ko-jp_diverse: 1200000个示例，487436681字节。
ko-jp_daily_colloquial: 600000个示例，107934988字节。
ko-jp_broadcast: 600000个示例，82552662字节。
ko-jp_basic_science: 623个示例，166244字节。

韩-英翻译

ko-en_utterance_type: 12001个示例，2892091字节。
ko-en_technoloy_science: 1200144个示例，510018661字节。
ko-en_humanities: 2973个示例，1299202字节。
ko-en_daily_colloquial: 1200000个示例，215413267字节。
ko-en_broadcast: 400527个示例，73391239字节。
ko-en_basic_science: 358个示例，97586字节。

韩-中翻译

ko-ch_utterance_type: 12000个示例，2766397字节。
ko-ch_technoloy_science: 1156971个示例，410410429字节。
ko-ch_social_science: 1040000个示例，376231429字节。
ko-ch_humanities: 33785个示例，10864676字节。
ko-ch_daily_colloquial: 600000个示例，92917863字节。
ko-ch_broadcast: 600000个示例，73145813字节。
ko-ch_basic_science: 38729个示例，9966501字节。

日-韩翻译

jp-ko_utterance_type: 12000个示例，2476448字节。
jp-ko_humanities: 738个示例，251167字节。
jp-ko_daily_colloquial: 600000个示例，115367572字节。
jp-ko_broadcast: 360000个示例，87129509字节。
jp-ko_basic_science: 612个示例，238084字节。

英-韩翻译

en-ko_utterance_type: 12000个示例，2830268字节。
en-ko_industry_info: 639994个示例，1004448000字节。
en-ko_humanities: 2817个示例，1148107字节。
en-ko_food: 1200000个示例，432130873字节。
en-ko_daily_colloquial: 1200307个示例，224717393字节。
en-ko_broadcast: 121124个示例，30873579字节。
en-ko_basic_science: 356个示例，146957字节。

中-韩翻译

ch-ko_utterance_type: 12000个示例，2069035字节。
ch-ko_humanities: 33432个示例，10460482字节。
ch-ko_food: 1200000个示例，413112881字节。
ch-ko_daily_colloquial: 600000个示例，98190198字节。
ch-ko_broadcast: 361856个示例，53873823字节。
ch-ko_basic_science: 37747个示例，13388919字节。

5,000+

优质数据集

54 个

任务类型

进入经典数据集