five

wisenut-nlp-team/llama_ko_smr

收藏
Hugging Face2024-04-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/wisenut-nlp-team/llama_ko_smr
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: art features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23253173 num_examples: 15627 download_size: 12801716 dataset_size: 23253173 - config_name: artifact_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 362643834 num_examples: 89531 download_size: 167429211 dataset_size: 362643834 - config_name: beauty_and_health features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 11495982 num_examples: 19203 download_size: 6174548 dataset_size: 11495982 - config_name: briefing features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 84092000 num_examples: 36000 download_size: 26138279 dataset_size: 84092000 - config_name: c_event features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 70105743 num_examples: 31166 download_size: 21295859 dataset_size: 70105743 - config_name: culture features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 35908844 num_examples: 23700 download_size: 11289413 dataset_size: 35908844 - config_name: daily_and_occupation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 14495402 num_examples: 22982 download_size: 7769431 dataset_size: 14495402 - config_name: edit features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 41226597 num_examples: 18000 download_size: 13617131 dataset_size: 41226597 - config_name: editorial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 204950743 num_examples: 63768 download_size: 117562937 dataset_size: 204950743 - config_name: education features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 8992532 num_examples: 14759 download_size: 4846739 dataset_size: 8992532 - config_name: enter features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 77007245 num_examples: 36092 download_size: 24622632 dataset_size: 77007245 - config_name: etc features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13009615 num_examples: 7597 download_size: 6696866 dataset_size: 13009615 - config_name: event features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13632825 num_examples: 24006 download_size: 7160232 dataset_size: 13632825 - config_name: fm_drama features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 65279567 num_examples: 36000 download_size: 20994133 dataset_size: 65279567 - config_name: food_and_drink features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 18831258 num_examples: 33957 download_size: 9768013 dataset_size: 18831258 - config_name: fs_drama features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 62984894 num_examples: 36004 download_size: 20000234 dataset_size: 62984894 - config_name: his_cul features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 30609601 num_examples: 18000 download_size: 10628675 dataset_size: 30609601 - config_name: history features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 48220219 num_examples: 25766 download_size: 14665043 dataset_size: 48220219 - config_name: housing_and_living features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 29295812 num_examples: 50827 download_size: 15854030 dataset_size: 29295812 - config_name: law features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 59837947 num_examples: 27333 download_size: 29960383 dataset_size: 59837947 - config_name: leisure features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23140399 num_examples: 39654 download_size: 12420477 dataset_size: 23140399 - config_name: life_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 35720463 num_examples: 7802 download_size: 17482630 dataset_size: 35720463 - config_name: literature features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 51905166 num_examples: 21600 download_size: 18123605 dataset_size: 51905166 - config_name: minute features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 149240389 num_examples: 61200 download_size: 41433544 dataset_size: 149240389 - config_name: narration features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 24511774 num_examples: 18742 download_size: 7720190 dataset_size: 24511774 - config_name: nature_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 31775215 num_examples: 10862 download_size: 12939961 dataset_size: 31775215 - config_name: news_r features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 161506493 num_examples: 48600 download_size: 52108494 dataset_size: 161506493 - config_name: newspaper features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 778034038 num_examples: 274105 download_size: 453662932 dataset_size: 778034038 - config_name: paper features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 669171434 num_examples: 324174 download_size: 354490940 dataset_size: 669171434 - config_name: paper2 features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 40000149 num_examples: 18000 download_size: 13367455 dataset_size: 40000149 - config_name: patent features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 6932303601 num_examples: 312600 download_size: 2398178917 dataset_size: 6932303601 - config_name: patent_section features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 499358509 num_examples: 151000 download_size: 239316958 dataset_size: 499358509 - config_name: public features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 40666888 num_examples: 18000 download_size: 12762114 dataset_size: 40666888 - config_name: relationships features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 45706612 num_examples: 80022 download_size: 24000637 dataset_size: 45706612 - config_name: shopping features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 17079513 num_examples: 29586 download_size: 9159776 dataset_size: 17079513 - config_name: social_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 186311981 num_examples: 129870 download_size: 96285745 dataset_size: 186311981 - config_name: speech features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 162899290 num_examples: 72000 download_size: 48896868 dataset_size: 162899290 - config_name: technology_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 37930287 num_examples: 26907 download_size: 19950147 dataset_size: 37930287 - config_name: wisenut features: - name: instruction dtype: string - name: input dtype: string - name: title dtype: string - name: output dtype: string - name: lenght dtype: string splits: - name: train num_bytes: 440353415 num_examples: 228728 download_size: 145508702 dataset_size: 440353415 configs: - config_name: art data_files: - split: train path: art/train-* - config_name: artifact_science data_files: - split: train path: artifact_science/train-* - config_name: beauty_and_health data_files: - split: train path: beauty_and_health/train-* - config_name: briefing data_files: - split: train path: briefing/train-* - config_name: c_event data_files: - split: train path: c_event/train-* - config_name: culture data_files: - split: train path: culture/train-* - config_name: daily_and_occupation data_files: - split: train path: daily_and_occupation/train-* - config_name: edit data_files: - split: train path: edit/train-* - config_name: editorial data_files: - split: train path: editorial/train-* - config_name: education data_files: - split: train path: education/train-* - config_name: enter data_files: - split: train path: enter/train-* - config_name: etc data_files: - split: train path: etc/train-* - config_name: event data_files: - split: train path: event/train-* - config_name: fm_drama data_files: - split: train path: fm_drama/train-* - config_name: food_and_drink data_files: - split: train path: food_and_drink/train-* - config_name: fs_drama data_files: - split: train path: fs_drama/train-* - config_name: his_cul data_files: - split: train path: his_cul/train-* - config_name: history data_files: - split: train path: history/train-* - config_name: housing_and_living data_files: - split: train path: housing_and_living/train-* - config_name: law data_files: - split: train path: law/train-* - config_name: leisure data_files: - split: train path: leisure/train-* - config_name: life_science data_files: - split: train path: life_science/train-* - config_name: literature data_files: - split: train path: literature/train-* - config_name: minute data_files: - split: train path: minute/train-* - config_name: narration data_files: - split: train path: narration/train-* - config_name: nature_science data_files: - split: train path: nature_science/train-* - config_name: news_r data_files: - split: train path: news_r/train-* - config_name: newspaper data_files: - split: train path: newspaper/train-* - config_name: paper data_files: - split: train path: paper/train-* - config_name: paper2 data_files: - split: train path: paper2/train-* - config_name: patent data_files: - split: train path: patent/train-* - config_name: patent_section data_files: - split: train path: patent_section/train-* - config_name: public data_files: - split: train path: public/train-* - config_name: relationships data_files: - split: train path: relationships/train-* - config_name: shopping data_files: - split: train path: shopping/train-* - config_name: social_science data_files: - split: train path: social_science/train-* - config_name: speech data_files: - split: train path: speech/train-* - config_name: technology_science data_files: - split: train path: technology_science/train-* - config_name: wisenut data_files: - split: train path: wisenut/train-* --- ## [문서요약 텍스트](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=97) - subset: law - length: 27.3k - subset: newspaper - length: 274k - subset: editorial - length: 63.8k ## [도서자료 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=93) * subset: art - length: 15.6k * subset: technology_science - length: 26.9k * subset: social_science - length: 130k * subset: etc - length: 7.6k ## [논문자료 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=90) * subset: paper - length: 324k * subset: patent - length: 313k * subset: patent_section - length: 151k ## [방송 콘텐츠 대본 요약 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=591) * subset: fm_drama - length: 36k * subset: fs_drama - length: 36k * subset: history - length: 25.8k * subset: culture - length: 23.7k * subset: enter - length: 36k * subset: c_event - length: 31.1k ## [요약문 및 레포트 생성 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=582) * subset: news_r - length: 48.6k * subset: briefing - length: 36k * subset: his_cul - length: 18k * subset: paper2 - length: 18k * subset: minute - length: 61.2k * subset: edit - length: 18k * subset: public - length: 18k * subset: speech - length: 72k * subset: literature - length: 21.6k * subset: narration - length: 18.7k ## [한국어 대화 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=117) * subset: relationships - length: 80k * subset: beauty_and_health - length: 19.2k * subset: shopping - length: 29.5k * subset: education - length: 14.7k * subset: food_and_drink - length: 33.9k * subset: leisure - length: 39.6k * subset: daily_and_occupation - length: 22.9k * subset: housing_and_living - length: 50.8k * subset: event - length: 24k ## [기술과학 요약 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71532) * subset: life_science - length: 7.8k * subset: artifact_science - length: 89.5k * subset: nature_science - length: 10.8k
提供机构:
wisenut-nlp-team
原始信息汇总

数据集概述

本数据集包含多个子集,每个子集针对不同的主题和领域,具有各自的特征和数据规模。以下是各子集的详细信息:

1. 艺术 (art)

  • 特征: instruction, input, output
  • 训练集: 15627个样本,总大小23253173字节
  • 下载大小: 12801716字节

2. 文物科学 (artifact_science)

  • 特征: instruction, input, output
  • 训练集: 89531个样本,总大小362643834字节
  • 下载大小: 167429211字节

3. 美容与健康 (beauty_and_health)

  • 特征: instruction, input, output
  • 训练集: 19203个样本,总大小11495982字节
  • 下载大小: 6174548字节

4. 简报 (briefing)

  • 特征: instruction, input, output, tpye
  • 训练集: 36000个样本,总大小84092000字节
  • 下载大小: 26138279字节

5. C事件 (c_event)

  • 特征: instruction, input, output, tpye
  • 训练集: 31166个样本,总大小70105743字节
  • 下载大小: 21295859字节

6. 文化 (culture)

  • 特征: instruction, input, output, tpye
  • 训练集: 23700个样本,总大小35908844字节
  • 下载大小: 11289413字节

7. 日常生活与职业 (daily_and_occupation)

  • 特征: instruction, input, output
  • 训练集: 22982个样本,总大小14495402字节
  • 下载大小: 7769431字节

8. 编辑 (edit)

  • 特征: instruction, input, output, tpye
  • 训练集: 18000个样本,总大小41226597字节
  • 下载大小: 13617131字节

9. 社论 (editorial)

  • 特征: instruction, input, output
  • 训练集: 63768个样本,总大小204950743字节
  • 下载大小: 117562937字节

10. 教育 (education)

  • 特征: instruction, input, output
  • 训练集: 14759个样本,总大小8992532字节
  • 下载大小: 4846739字节

11. 进入 (enter)

  • 特征: instruction, input, output, tpye
  • 训练集: 36092个样本,总大小77007245字节
  • 下载大小: 24622632字节

12. 其他 (etc)

  • 特征: instruction, input, output
  • 训练集: 7597个样本,总大小13009615字节
  • 下载大小: 6696866字节

13. 事件 (event)

  • 特征: instruction, input, output
  • 训练集: 24006个样本,总大小13632825字节
  • 下载大小: 7160232字节

14. FM戏剧 (fm_drama)

  • 特征: instruction, input, output, tpye
  • 训练集: 36000个样本,总大小65279567字节
  • 下载大小: 20994133字节

15. 食品与饮料 (food_and_drink)

  • 特征: instruction, input, output
  • 训练集: 33957个样本,总大小18831258字节
  • 下载大小: 9768013字节

16. FS戏剧 (fs_drama)

  • 特征: instruction, input, output, tpye
  • 训练集: 36004个样本,总大小62984894字节
  • 下载大小: 20000234字节

17. 历史与文化 (his_cul)

  • 特征: instruction, input, output, tpye
  • 训练集: 18000个样本,总大小30609601字节
  • 下载大小: 10628675字节

18. 历史 (history)

  • 特征: instruction, input, output, tpye
  • 训练集: 25766个样本,总大小48220219字节
  • 下载大小: 14665043字节

19. 住房与生活 (housing_and_living)

  • 特征: instruction, input, output
  • 训练集: 50827个样本,总大小29295812字节
  • 下载大小: 15854030字节

20. 法律 (law)

  • 特征: instruction, input, output
  • 训练集: 27333个样本,总大小59837947字节
  • 下载大小: 29960383字节

21. 休闲 (leisure)

  • 特征: instruction, input, output
  • 训练集: 39654个样本,总大小23140399字节
  • 下载大小: 12420477字节

22. 生命科学 (life_science)

  • 特征: instruction, input, output
  • 训练集: 7802个样本,总大小35720463字节
  • 下载大小: 17482630字节

23. 文学 (literature)

  • 特征: instruction, input, output, tpye
  • 训练集: 21600个样本,总大小51905166字节
  • 下载大小: 18123605字节

24. 分钟 (minute)

  • 特征: instruction, input, output, tpye
  • 训练集: 61200个样本,总大小149240389字节
  • 下载大小: 41433544字节

25. 叙述 (narration)

  • 特征: instruction, input, output, tpye
  • 训练集: 18742个样本,总大小24511774字节
  • 下载大小: 7720190字节

26. 自然科学 (nature_science)

  • 特征: instruction, input, output
  • 训练集: 10862个样本,总大小31775215字节
  • 下载大小: 12939961字节

27. 新闻R (news_r)

  • 特征: instruction, input, output, tpye
  • 训练集: 48600个样本,总大小161506493字节
  • 下载大小: 52108494字节

28. 报纸 (newspaper)

  • 特征: instruction, input, output
  • 训练集: 274105个样本,总大小778034038字节
  • 下载大小: 453662932字节

29. 论文 (paper)

  • 特征: instruction, input, output
  • 训练集: 324174个样本,总大小669171434字节
  • 下载大小: 354490940字节

30. 论文2 (paper2)

  • 特征: instruction, input, output, tpye
  • 训练集: 18000个样本,总大小40000149字节
  • 下载大小: 13367455字节

31. 专利 (patent)

  • 特征: instruction, input, output
  • 训练集: 312600个样本,总大小6932303601字节
  • 下载大小: 2398178917字节

32. 专利部分 (patent_section)

  • 特征: instruction, input, output
  • 训练集: 151000个样本,总大小499358509字节
  • 下载大小: 239316958字节

33. 公共 (public)

  • 特征: instruction, input, output, tpye
  • 训练集: 18000个样本,总大小40666888字节
  • 下载大小: 12762114字节

34. 关系 (relationships)

  • 特征: instruction, input, output
  • 训练集: 80022个样本,总大小45706612字节
  • 下载大小: 24000637字节

35. 购物 (shopping)

  • 特征: instruction, input, output
  • 训练集: 29586个样本,总大小17079513字节
  • 下载大小: 9159776字节

36. 社会科学 (social_science)

  • 特征: instruction, input, output
  • 训练集: 129870个样本,总大小186311981字节
  • 下载大小: 96285745字节

37. 演讲 (speech)

  • 特征: instruction, input, output, tpye
  • 训练集: 72000个样本,总大小162899290字节
  • 下载大小: 48896868字节

38. 技术科学 (technology_science)

  • 特征: instruction, input, output
  • 训练集: 26907个样本,总大小37930287字节
  • 下载大小: 19950147字节

39. Wisenut (wisenut)

  • 特征: instruction, input, title, output, lenght
  • 训练集: 228728个样本,总大小440353415字节
  • 下载大小: 145508702字节

以上数据集提供了丰富的文本数据,适用于多种研究和应用场景。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作