five

wisenut-nlp-team/llama_nmt

收藏
Hugging Face2024-05-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wisenut-nlp-team/llama_nmt
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ko-jp_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 3166308 num_examples: 11999 dataset_size: 3166308 - config_name: ko-jp_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 44260625 num_examples: 123455 dataset_size: 44260625 - config_name: ko-jp_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 242542 num_examples: 744 dataset_size: 242542 - config_name: ko-jp_diverse features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 487436681 num_examples: 1200000 dataset_size: 487436681 - config_name: ko-jp_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 107934988 num_examples: 600000 dataset_size: 107934988 - config_name: ko-jp_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 82552662 num_examples: 600000 dataset_size: 82552662 - config_name: ko-jp_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 166244 num_examples: 623 dataset_size: 166244 - config_name: ko-en_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2892091 num_examples: 12001 dataset_size: 2892091 - config_name: ko-en_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 510018661 num_examples: 1200144 dataset_size: 510018661 - config_name: ko-en_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1299202 num_examples: 2973 dataset_size: 1299202 - config_name: ko-en_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 215413267 num_examples: 1200000 dataset_size: 215413267 - config_name: ko-en_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 73391239 num_examples: 400527 dataset_size: 73391239 - config_name: ko-en_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 97586 num_examples: 358 dataset_size: 97586 - config_name: ko-ch_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2766397 num_examples: 12000 dataset_size: 2766397 - config_name: ko-ch_technoloy_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 410410429 num_examples: 1156971 dataset_size: 410410429 - config_name: ko-ch_social_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 376231429 num_examples: 1040000 dataset_size: 376231429 - config_name: ko-ch_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10864676 num_examples: 33785 dataset_size: 10864676 - config_name: ko-ch_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 92917863 num_examples: 600000 dataset_size: 92917863 - config_name: ko-ch_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 73145813 num_examples: 600000 dataset_size: 73145813 - config_name: ko-ch_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 9966501 num_examples: 38729 dataset_size: 9966501 - config_name: jp-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2476448 num_examples: 12000 dataset_size: 2476448 - config_name: jp-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 251167 num_examples: 738 dataset_size: 251167 - config_name: jp-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 115367572 num_examples: 600000 dataset_size: 115367572 - config_name: jp-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 87129509 num_examples: 360000 dataset_size: 87129509 - config_name: jp-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 238084 num_examples: 612 dataset_size: 238084 - config_name: en-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2830268 num_examples: 12000 dataset_size: 2830268 - config_name: en-ko_industry_info features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1004448000 num_examples: 639994 dataset_size: 1004448000 - config_name: en-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1148107 num_examples: 2817 dataset_size: 1148107 - config_name: en-ko_food features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 432130873 num_examples: 1200000 dataset_size: 432130873 - config_name: en-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 224717393 num_examples: 1200307 dataset_size: 224717393 - config_name: en-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 30873579 num_examples: 121124 dataset_size: 30873579 - config_name: en-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 146957 num_examples: 356 dataset_size: 146957 - config_name: ch-ko_utterance_type features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2069035 num_examples: 12000 dataset_size: 2069035 - config_name: ch-ko_humanities features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10460482 num_examples: 33432 dataset_size: 10460482 - config_name: ch-ko_food features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 413112881 num_examples: 1200000 dataset_size: 413112881 - config_name: ch-ko_daily_colloquial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 98190198 num_examples: 600000 dataset_size: 98190198 - config_name: ch-ko_broadcast features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 53873823 num_examples: 361856 dataset_size: 53873823 - config_name: ch-ko_basic_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13388919 num_examples: 37747 dataset_size: 13388919 configs: - config_name: ko-jp_utterance_type data_files: - split: train path: translation/ko-jp_utterance_type/* - config_name: ko-jp_technoloy_science data_files: - split: train path: translation/ko-jp_technoloy_science/* - config_name: ko-jp_humanities data_files: - split: train path: translation/ko-jp_humanities/* - config_name: ko-jp_diverse data_files: - split: train path: translation/ko-jp_diverse/* - config_name: ko-jp_daily_colloquial data_files: - split: train path: translation/ko-jp_daily_colloquial/* - config_name: ko-jp_broadcast data_files: - split: train path: translation/ko-jp_broadcast/* - config_name: ko-jp_basic_science data_files: - split: train path: translation/ko-jp_basic_science/* - config_name: ko-en_utterance_type data_files: - split: train path: translation/ko-en_utterance_type/* - config_name: ko-en_technoloy_science data_files: - split: train path: translation/ko-en_technoloy_science/* - config_name: ko-en_humanities data_files: - split: train path: translation/ko-en_humanities/* - config_name: ko-en_daily_colloquial data_files: - split: train path: translation/ko-en_daily_colloquial/* - config_name: ko-en_broadcast data_files: - split: train path: translation/ko-en_broadcast/* - config_name: ko-en_basic_science data_files: - split: train path: translation/ko-en_basic_science/* - config_name: ko-ch_utterance_type data_files: - split: train path: translation/ko-ch_utterance_type/* - config_name: ko-ch_technoloy_science data_files: - split: train path: translation/ko-ch_technoloy_science/* - config_name: ko-ch_social_science data_files: - split: train path: translation/ko-ch_social_science/* - config_name: ko-ch_humanities data_files: - split: train path: translation/ko-ch_humanities/* - config_name: ko-ch_daily_colloquial data_files: - split: train path: translation/ko-ch_daily_colloquial/* - config_name: ko-ch_broadcast data_files: - split: train path: translation/ko-ch_broadcast/* - config_name: ko-ch_basic_science data_files: - split: train path: translation/ko-ch_basic_science/* - config_name: jp-ko_utterance_type data_files: - split: train path: translation/jp-ko_utterance_type/* - config_name: jp-ko_humanities data_files: - split: train path: translation/jp-ko_humanities/* - config_name: jp-ko_daily_colloquial data_files: - split: train path: translation/jp-ko_daily_colloquial/* - config_name: jp-ko_broadcast data_files: - split: train path: translation/jp-ko_broadcast/* - config_name: jp-ko_basic_science data_files: - split: train path: translation/jp-ko_basic_science/* - config_name: en-ko_utterance_type data_files: - split: train path: translation/en-ko_utterance_type/* - config_name: en-ko_industry_info data_files: - split: train path: translation/en-ko_industry_info/* - config_name: en-ko_humanities data_files: - split: train path: translation/en-ko_humanities/* - config_name: en-ko_food data_files: - split: train path: translation/en-ko_food/* - config_name: en-ko_daily_colloquial data_files: - split: train path: translation/en-ko_daily_colloquial/* - config_name: en-ko_broadcast data_files: - split: train path: translation/en-ko_broadcast/* - config_name: en-ko_basic_science data_files: - split: train path: translation/en-ko_basic_science/* - config_name: ch-ko_utterance_type data_files: - split: train path: translation/ch-ko_utterance_type/* - config_name: ch-ko_humanities data_files: - split: train path: translation/ch-ko_humanities/* - config_name: ch-ko_food data_files: - split: train path: translation/ch-ko_food/* - config_name: ch-ko_daily_colloquial data_files: - split: train path: translation/ch-ko_daily_colloquial/* - config_name: ch-ko_broadcast data_files: - split: train path: translation/ch-ko_broadcast/* - config_name: ch-ko_basic_science data_files: - split: train path: translation/ch-ko_basic_science/* --- ## 중-한 번역 - subset: ch-ko_basic_science - length: 37.7k - subset: ch-ko_broadcast - length: 362k - subset: ch-ko_daily_colloquial - length: 600k - subset: ch-ko_food - length: 1.2M - subset: ch-ko_humanities - length: 33.4k - subset: ch-ko_utterance_type - length: 12k ## 영-한 번역 - subset: en-ko_basic_science - length: 356 - subset: en-ko_broadcast - length: 121k - subset: en-ko_daily_colloquial - length: 1.2M - subset: en-ko_food - length: 1.2M - subset: en-ko_humanities - length: 2.82k - subset: en-ko_industry_info - length: 640k - subset: en-ko_utterance_type - length: 12k ## 일-한 번역 - subset: jp-ko_basic_science - length: 612 - subset: jp-ko_broadcast - length: 360k - subset: jp-ko_daily_colloquial - length: 600k - subset: jp-ko_humanities - length: 738 - subset: jp-ko_utterance_type - length: 12k ## 한-중 번역 - subset: ko-ch_basic_science - length: 38.7k - subset: ko-ch_broadcast - length: 600k - subset: ko-ch_daily_colloquial - length: 600k - subset: ko-ch_humanities - length: 33.8k - subset: ko-ch_social_science - length: 1.04M - subset: ko-ch_technology_science - length: 1.16M - subset: ko-ch_utterance_type - length: 12k ## 한-영 번역 - subset: ko-en_basic_science - length: 358 - subset: ko-en_broadcast - length: 401k - subset: ko-en_daily_colloquial - length: 1.2M - subset: ko-en_humanities - length: 2.97k - subset: ko-en_technology_science - length: 1.2M - subset: ko-en_utterance_type - length: 12k ## 한-일 번역 - subset: ko-jp_basic_science - length: 623 - subset: ko-jp_broadcast - length: 600k - subset: ko-jp_daily_colloquial - length: 600k - subset: ko-jp_diverse - length: 1.2M - subset: ko-jp_humanities - length: 744 - subset: ko-jp_technology_science - length: 123k - subset: ko-jp_utterance_type - length: 12k
提供机构:
wisenut-nlp-team
原始信息汇总

数据集概述

数据集配置信息

  • 配置名称: 包含多种语言对,如ko-jp_utterance_type, ko-en_technoloy_science等。
  • 特征:
    • instruction: 数据类型为string。
    • input: 数据类型为string。
    • output: 数据类型为string。

数据集分割

  • 分割类型: 主要为train。
  • 数据量: 每个配置的train分割包含的示例数量和字节数不同,例如:
    • ko-jp_utterance_type: 11999个示例,3166308字节。
    • ko-en_technoloy_science: 123455个示例,44260625字节。

数据集大小

  • 总大小: 每个配置的总数据大小与其train分割的字节数相同,例如:
    • ko-jp_utterance_type: 3166308字节。
    • ko-en_technoloy_science: 44260625字节。

数据文件路径

  • 路径: 每个配置的数据文件路径格式为translation/{config_name}/*,例如:
    • ko-jp_utterance_type: translation/ko-jp_utterance_type/*
    • ko-en_technoloy_science: translation/ko-en_technoloy_science/*

数据集详细信息

韩-日翻译

  • ko-jp_utterance_type: 11999个示例,3166308字节。
  • ko-jp_technoloy_science: 123455个示例,44260625字节。
  • ko-jp_humanities: 744个示例,242542字节。
  • ko-jp_diverse: 1200000个示例,487436681字节。
  • ko-jp_daily_colloquial: 600000个示例,107934988字节。
  • ko-jp_broadcast: 600000个示例,82552662字节。
  • ko-jp_basic_science: 623个示例,166244字节。

韩-英翻译

  • ko-en_utterance_type: 12001个示例,2892091字节。
  • ko-en_technoloy_science: 1200144个示例,510018661字节。
  • ko-en_humanities: 2973个示例,1299202字节。
  • ko-en_daily_colloquial: 1200000个示例,215413267字节。
  • ko-en_broadcast: 400527个示例,73391239字节。
  • ko-en_basic_science: 358个示例,97586字节。

韩-中翻译

  • ko-ch_utterance_type: 12000个示例,2766397字节。
  • ko-ch_technoloy_science: 1156971个示例,410410429字节。
  • ko-ch_social_science: 1040000个示例,376231429字节。
  • ko-ch_humanities: 33785个示例,10864676字节。
  • ko-ch_daily_colloquial: 600000个示例,92917863字节。
  • ko-ch_broadcast: 600000个示例,73145813字节。
  • ko-ch_basic_science: 38729个示例,9966501字节。

日-韩翻译

  • jp-ko_utterance_type: 12000个示例,2476448字节。
  • jp-ko_humanities: 738个示例,251167字节。
  • jp-ko_daily_colloquial: 600000个示例,115367572字节。
  • jp-ko_broadcast: 360000个示例,87129509字节。
  • jp-ko_basic_science: 612个示例,238084字节。

英-韩翻译

  • en-ko_utterance_type: 12000个示例,2830268字节。
  • en-ko_industry_info: 639994个示例,1004448000字节。
  • en-ko_humanities: 2817个示例,1148107字节。
  • en-ko_food: 1200000个示例,432130873字节。
  • en-ko_daily_colloquial: 1200307个示例,224717393字节。
  • en-ko_broadcast: 121124个示例,30873579字节。
  • en-ko_basic_science: 356个示例,146957字节。

中-韩翻译

  • ch-ko_utterance_type: 12000个示例,2069035字节。
  • ch-ko_humanities: 33432个示例,10460482字节。
  • ch-ko_food: 1200000个示例,413112881字节。
  • ch-ko_daily_colloquial: 600000个示例,98190198字节。
  • ch-ko_broadcast: 361856个示例,53873823字节。
  • ch-ko_basic_science: 37747个示例,13388919字节。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作