wisenut-nlp-team/llama_ko_smr

Name: wisenut-nlp-team/llama_ko_smr
Creator: wisenut-nlp-team
Published: 2024-04-30 07:31:40
License: 暂无描述

Hugging Face2024-04-30 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/wisenut-nlp-team/llama_ko_smr

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: art features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23253173 num_examples: 15627 download_size: 12801716 dataset_size: 23253173 - config_name: artifact_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 362643834 num_examples: 89531 download_size: 167429211 dataset_size: 362643834 - config_name: beauty_and_health features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 11495982 num_examples: 19203 download_size: 6174548 dataset_size: 11495982 - config_name: briefing features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 84092000 num_examples: 36000 download_size: 26138279 dataset_size: 84092000 - config_name: c_event features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 70105743 num_examples: 31166 download_size: 21295859 dataset_size: 70105743 - config_name: culture features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 35908844 num_examples: 23700 download_size: 11289413 dataset_size: 35908844 - config_name: daily_and_occupation features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 14495402 num_examples: 22982 download_size: 7769431 dataset_size: 14495402 - config_name: edit features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 41226597 num_examples: 18000 download_size: 13617131 dataset_size: 41226597 - config_name: editorial features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 204950743 num_examples: 63768 download_size: 117562937 dataset_size: 204950743 - config_name: education features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 8992532 num_examples: 14759 download_size: 4846739 dataset_size: 8992532 - config_name: enter features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 77007245 num_examples: 36092 download_size: 24622632 dataset_size: 77007245 - config_name: etc features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13009615 num_examples: 7597 download_size: 6696866 dataset_size: 13009615 - config_name: event features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 13632825 num_examples: 24006 download_size: 7160232 dataset_size: 13632825 - config_name: fm_drama features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 65279567 num_examples: 36000 download_size: 20994133 dataset_size: 65279567 - config_name: food_and_drink features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 18831258 num_examples: 33957 download_size: 9768013 dataset_size: 18831258 - config_name: fs_drama features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 62984894 num_examples: 36004 download_size: 20000234 dataset_size: 62984894 - config_name: his_cul features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 30609601 num_examples: 18000 download_size: 10628675 dataset_size: 30609601 - config_name: history features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 48220219 num_examples: 25766 download_size: 14665043 dataset_size: 48220219 - config_name: housing_and_living features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 29295812 num_examples: 50827 download_size: 15854030 dataset_size: 29295812 - config_name: law features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 59837947 num_examples: 27333 download_size: 29960383 dataset_size: 59837947 - config_name: leisure features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 23140399 num_examples: 39654 download_size: 12420477 dataset_size: 23140399 - config_name: life_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 35720463 num_examples: 7802 download_size: 17482630 dataset_size: 35720463 - config_name: literature features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 51905166 num_examples: 21600 download_size: 18123605 dataset_size: 51905166 - config_name: minute features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 149240389 num_examples: 61200 download_size: 41433544 dataset_size: 149240389 - config_name: narration features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 24511774 num_examples: 18742 download_size: 7720190 dataset_size: 24511774 - config_name: nature_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 31775215 num_examples: 10862 download_size: 12939961 dataset_size: 31775215 - config_name: news_r features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 161506493 num_examples: 48600 download_size: 52108494 dataset_size: 161506493 - config_name: newspaper features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 778034038 num_examples: 274105 download_size: 453662932 dataset_size: 778034038 - config_name: paper features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 669171434 num_examples: 324174 download_size: 354490940 dataset_size: 669171434 - config_name: paper2 features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 40000149 num_examples: 18000 download_size: 13367455 dataset_size: 40000149 - config_name: patent features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 6932303601 num_examples: 312600 download_size: 2398178917 dataset_size: 6932303601 - config_name: patent_section features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 499358509 num_examples: 151000 download_size: 239316958 dataset_size: 499358509 - config_name: public features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 40666888 num_examples: 18000 download_size: 12762114 dataset_size: 40666888 - config_name: relationships features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 45706612 num_examples: 80022 download_size: 24000637 dataset_size: 45706612 - config_name: shopping features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 17079513 num_examples: 29586 download_size: 9159776 dataset_size: 17079513 - config_name: social_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 186311981 num_examples: 129870 download_size: 96285745 dataset_size: 186311981 - config_name: speech features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: tpye dtype: string splits: - name: train num_bytes: 162899290 num_examples: 72000 download_size: 48896868 dataset_size: 162899290 - config_name: technology_science features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 37930287 num_examples: 26907 download_size: 19950147 dataset_size: 37930287 - config_name: wisenut features: - name: instruction dtype: string - name: input dtype: string - name: title dtype: string - name: output dtype: string - name: lenght dtype: string splits: - name: train num_bytes: 440353415 num_examples: 228728 download_size: 145508702 dataset_size: 440353415 configs: - config_name: art data_files: - split: train path: art/train-* - config_name: artifact_science data_files: - split: train path: artifact_science/train-* - config_name: beauty_and_health data_files: - split: train path: beauty_and_health/train-* - config_name: briefing data_files: - split: train path: briefing/train-* - config_name: c_event data_files: - split: train path: c_event/train-* - config_name: culture data_files: - split: train path: culture/train-* - config_name: daily_and_occupation data_files: - split: train path: daily_and_occupation/train-* - config_name: edit data_files: - split: train path: edit/train-* - config_name: editorial data_files: - split: train path: editorial/train-* - config_name: education data_files: - split: train path: education/train-* - config_name: enter data_files: - split: train path: enter/train-* - config_name: etc data_files: - split: train path: etc/train-* - config_name: event data_files: - split: train path: event/train-* - config_name: fm_drama data_files: - split: train path: fm_drama/train-* - config_name: food_and_drink data_files: - split: train path: food_and_drink/train-* - config_name: fs_drama data_files: - split: train path: fs_drama/train-* - config_name: his_cul data_files: - split: train path: his_cul/train-* - config_name: history data_files: - split: train path: history/train-* - config_name: housing_and_living data_files: - split: train path: housing_and_living/train-* - config_name: law data_files: - split: train path: law/train-* - config_name: leisure data_files: - split: train path: leisure/train-* - config_name: life_science data_files: - split: train path: life_science/train-* - config_name: literature data_files: - split: train path: literature/train-* - config_name: minute data_files: - split: train path: minute/train-* - config_name: narration data_files: - split: train path: narration/train-* - config_name: nature_science data_files: - split: train path: nature_science/train-* - config_name: news_r data_files: - split: train path: news_r/train-* - config_name: newspaper data_files: - split: train path: newspaper/train-* - config_name: paper data_files: - split: train path: paper/train-* - config_name: paper2 data_files: - split: train path: paper2/train-* - config_name: patent data_files: - split: train path: patent/train-* - config_name: patent_section data_files: - split: train path: patent_section/train-* - config_name: public data_files: - split: train path: public/train-* - config_name: relationships data_files: - split: train path: relationships/train-* - config_name: shopping data_files: - split: train path: shopping/train-* - config_name: social_science data_files: - split: train path: social_science/train-* - config_name: speech data_files: - split: train path: speech/train-* - config_name: technology_science data_files: - split: train path: technology_science/train-* - config_name: wisenut data_files: - split: train path: wisenut/train-* --- ## [문서요약 텍스트](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=97) - subset: law - length: 27.3k - subset: newspaper - length: 274k - subset: editorial - length: 63.8k ## [도서자료 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=93) * subset: art - length: 15.6k * subset: technology_science - length: 26.9k * subset: social_science - length: 130k * subset: etc - length: 7.6k ## [논문자료 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=90) * subset: paper - length: 324k * subset: patent - length: 313k * subset: patent_section - length: 151k ## [방송 콘텐츠 대본 요약 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=591) * subset: fm_drama - length: 36k * subset: fs_drama - length: 36k * subset: history - length: 25.8k * subset: culture - length: 23.7k * subset: enter - length: 36k * subset: c_event - length: 31.1k ## [요약문 및 레포트 생성 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=582) * subset: news_r - length: 48.6k * subset: briefing - length: 36k * subset: his_cul - length: 18k * subset: paper2 - length: 18k * subset: minute - length: 61.2k * subset: edit - length: 18k * subset: public - length: 18k * subset: speech - length: 72k * subset: literature - length: 21.6k * subset: narration - length: 18.7k ## [한국어 대화 요약](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=117) * subset: relationships - length: 80k * subset: beauty_and_health - length: 19.2k * subset: shopping - length: 29.5k * subset: education - length: 14.7k * subset: food_and_drink - length: 33.9k * subset: leisure - length: 39.6k * subset: daily_and_occupation - length: 22.9k * subset: housing_and_living - length: 50.8k * subset: event - length: 24k ## [기술과학 요약 데이터](https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71532) * subset: life_science - length: 7.8k * subset: artifact_science - length: 89.5k * subset: nature_science - length: 10.8k

提供机构：

wisenut-nlp-team

原始信息汇总

数据集概述

本数据集包含多个子集，每个子集针对不同的主题和领域，具有各自的特征和数据规模。以下是各子集的详细信息：

1. 艺术 (`art`)

特征: instruction, input, output
训练集: 15627个样本，总大小23253173字节
下载大小: 12801716字节

2. 文物科学 (`artifact_science`)

特征: instruction, input, output
训练集: 89531个样本，总大小362643834字节
下载大小: 167429211字节

3. 美容与健康 (`beauty_and_health`)

特征: instruction, input, output
训练集: 19203个样本，总大小11495982字节
下载大小: 6174548字节

4. 简报 (`briefing`)

特征: instruction, input, output, tpye
训练集: 36000个样本，总大小84092000字节
下载大小: 26138279字节

5. C事件 (`c_event`)

特征: instruction, input, output, tpye
训练集: 31166个样本，总大小70105743字节
下载大小: 21295859字节

6. 文化 (`culture`)

特征: instruction, input, output, tpye
训练集: 23700个样本，总大小35908844字节
下载大小: 11289413字节

7. 日常生活与职业 (`daily_and_occupation`)

特征: instruction, input, output
训练集: 22982个样本，总大小14495402字节
下载大小: 7769431字节

8. 编辑 (`edit`)

特征: instruction, input, output, tpye
训练集: 18000个样本，总大小41226597字节
下载大小: 13617131字节

9. 社论 (`editorial`)

特征: instruction, input, output
训练集: 63768个样本，总大小204950743字节
下载大小: 117562937字节

10. 教育 (`education`)

特征: instruction, input, output
训练集: 14759个样本，总大小8992532字节
下载大小: 4846739字节

11. 进入 (`enter`)

特征: instruction, input, output, tpye
训练集: 36092个样本，总大小77007245字节
下载大小: 24622632字节

12. 其他 (`etc`)

特征: instruction, input, output
训练集: 7597个样本，总大小13009615字节
下载大小: 6696866字节

13. 事件 (`event`)

特征: instruction, input, output
训练集: 24006个样本，总大小13632825字节
下载大小: 7160232字节

14. FM戏剧 (`fm_drama`)

特征: instruction, input, output, tpye
训练集: 36000个样本，总大小65279567字节
下载大小: 20994133字节

15. 食品与饮料 (`food_and_drink`)

特征: instruction, input, output
训练集: 33957个样本，总大小18831258字节
下载大小: 9768013字节

16. FS戏剧 (`fs_drama`)

特征: instruction, input, output, tpye
训练集: 36004个样本，总大小62984894字节
下载大小: 20000234字节

17. 历史与文化 (`his_cul`)

特征: instruction, input, output, tpye
训练集: 18000个样本，总大小30609601字节
下载大小: 10628675字节

18. 历史 (`history`)

特征: instruction, input, output, tpye
训练集: 25766个样本，总大小48220219字节
下载大小: 14665043字节

19. 住房与生活 (`housing_and_living`)

特征: instruction, input, output
训练集: 50827个样本，总大小29295812字节
下载大小: 15854030字节

20. 法律 (`law`)

特征: instruction, input, output
训练集: 27333个样本，总大小59837947字节
下载大小: 29960383字节

21. 休闲 (`leisure`)

特征: instruction, input, output
训练集: 39654个样本，总大小23140399字节
下载大小: 12420477字节

22. 生命科学 (`life_science`)

特征: instruction, input, output
训练集: 7802个样本，总大小35720463字节
下载大小: 17482630字节

23. 文学 (`literature`)

特征: instruction, input, output, tpye
训练集: 21600个样本，总大小51905166字节
下载大小: 18123605字节

24. 分钟 (`minute`)

特征: instruction, input, output, tpye
训练集: 61200个样本，总大小149240389字节
下载大小: 41433544字节

25. 叙述 (`narration`)

特征: instruction, input, output, tpye
训练集: 18742个样本，总大小24511774字节
下载大小: 7720190字节

26. 自然科学 (`nature_science`)

特征: instruction, input, output
训练集: 10862个样本，总大小31775215字节
下载大小: 12939961字节

27. 新闻R (`news_r`)

特征: instruction, input, output, tpye
训练集: 48600个样本，总大小161506493字节
下载大小: 52108494字节

28. 报纸 (`newspaper`)

特征: instruction, input, output
训练集: 274105个样本，总大小778034038字节
下载大小: 453662932字节

29. 论文 (`paper`)

特征: instruction, input, output
训练集: 324174个样本，总大小669171434字节
下载大小: 354490940字节

30. 论文2 (`paper2`)

特征: instruction, input, output, tpye
训练集: 18000个样本，总大小40000149字节
下载大小: 13367455字节

31. 专利 (`patent`)

特征: instruction, input, output
训练集: 312600个样本，总大小6932303601字节
下载大小: 2398178917字节

32. 专利部分 (`patent_section`)

特征: instruction, input, output
训练集: 151000个样本，总大小499358509字节
下载大小: 239316958字节

33. 公共 (`public`)

特征: instruction, input, output, tpye
训练集: 18000个样本，总大小40666888字节
下载大小: 12762114字节

34. 关系 (`relationships`)

特征: instruction, input, output
训练集: 80022个样本，总大小45706612字节
下载大小: 24000637字节

35. 购物 (`shopping`)

特征: instruction, input, output
训练集: 29586个样本，总大小17079513字节
下载大小: 9159776字节

36. 社会科学 (`social_science`)

特征: instruction, input, output
训练集: 129870个样本，总大小186311981字节
下载大小: 96285745字节

37. 演讲 (`speech`)

特征: instruction, input, output, tpye
训练集: 72000个样本，总大小162899290字节
下载大小: 48896868字节

38. 技术科学 (`technology_science`)

特征: instruction, input, output
训练集: 26907个样本，总大小37930287字节
下载大小: 19950147字节

39. Wisenut (`wisenut`)

特征: instruction, input, title, output, lenght
训练集: 228728个样本，总大小440353415字节
下载大小: 145508702字节

以上数据集提供了丰富的文本数据，适用于多种研究和应用场景。

5,000+

优质数据集

54 个

任务类型

进入经典数据集

wisenut-nlp-team/llama_ko_smr

数据集概述

1. 艺术 (art)

2. 文物科学 (artifact_science)

3. 美容与健康 (beauty_and_health)

4. 简报 (briefing)

5. C事件 (c_event)

6. 文化 (culture)

7. 日常生活与职业 (daily_and_occupation)

8. 编辑 (edit)

9. 社论 (editorial)

10. 教育 (education)

11. 进入 (enter)

12. 其他 (etc)

13. 事件 (event)

14. FM戏剧 (fm_drama)

15. 食品与饮料 (food_and_drink)

16. FS戏剧 (fs_drama)

17. 历史与文化 (his_cul)

18. 历史 (history)

19. 住房与生活 (housing_and_living)

20. 法律 (law)

21. 休闲 (leisure)

22. 生命科学 (life_science)

23. 文学 (literature)

24. 分钟 (minute)

25. 叙述 (narration)

26. 自然科学 (nature_science)

27. 新闻R (news_r)

28. 报纸 (newspaper)

29. 论文 (paper)

30. 论文2 (paper2)

31. 专利 (patent)

32. 专利部分 (patent_section)

33. 公共 (public)

34. 关系 (relationships)

35. 购物 (shopping)

36. 社会科学 (social_science)

37. 演讲 (speech)

38. 技术科学 (technology_science)

39. Wisenut (wisenut)