baoanhtran/guanaco-llama2-200
收藏Hugging Face2023-09-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/baoanhtran/guanaco-llama2-200
下载链接
链接失效反馈官方服务:
资源简介:
CulturaX 是一个多语言数据集,包含 167 种语言的 6.3 万亿个标记,专为大规模语言模型(LLM)开发而设计。该数据集经过严格的清洗和去重处理,包括语言识别、基于 URL 的过滤、基于指标的清洗、文档精炼和数据去重等多个阶段。数据集结合了 mC4(版本 3.1.0)和所有可访问的 OSCAR 语料库,经过深度清洗和去重后,包含 16TB 的数据(解压后为 27TB)。超过一半的数据集用于非英语语言,以显著增加数据量并增强多语言场景下模型训练的可行性。CulturaX 已在 HuggingFace 上公开发布,以促进多语言 LLM 的研究和进展。
CulturaX 是一个多语言数据集,包含 167 种语言的 6.3 万亿个标记,专为大规模语言模型(LLM)开发而设计。该数据集经过严格的清洗和去重处理,包括语言识别、基于 URL 的过滤、基于指标的清洗、文档精炼和数据去重等多个阶段。数据集结合了 mC4(版本 3.1.0)和所有可访问的 OSCAR 语料库,经过深度清洗和去重后,包含 16TB 的数据(解压后为 27TB)。超过一半的数据集用于非英语语言,以显著增加数据量并增强多语言场景下模型训练的可行性。CulturaX 已在 HuggingFace 上公开发布,以促进多语言 LLM 的研究和进展。
提供机构:
baoanhtran
原始信息汇总
CulturaX 数据集概述
数据集描述
基本信息
- 名称: CulturaX
- 语言数量: 167种语言
- 总令牌数: 6.3万亿
- 格式: parquet
- 大小: 16TB(解压后27TB)
语言和统计
| 序号 | 代码 | 语言 | 文档数量 | 令牌数量 | 令牌占比 |
|---|---|---|---|---|---|
| 0 | en | English | 3,241,065,682 | 2,846,970,578,793 | 45.13 |
| 1 | ru | Russian | 799,310,908 | 737,201,800,363 | 11.69 |
| 2 | es | Spanish | 450,937,645 | 373,845,662,394 | 5.93 |
| 3 | de | German | 420,017,484 | 357,030,348,021 | 5.66 |
| 4 | fr | French | 363,754,348 | 319,332,674,695 | 5.06 |
| 5 | zh | Chinese | 218,624,604 | 227,055,380,882 | 3.60 |
| 6 | it | Italian | 211,309,922 | 165,446,410,843 | 2.62 |
| 7 | pt | Portuguese | 190,289,658 | 136,941,763,923 | 2.17 |
| 8 | pl | Polish | 142,167,217 | 117,269,087,143 | 1.86 |
| 9 | ja | Japanese | 111,188,475 | 107,873,841,351 | 1.71 |
| 10 | vi | Vietnamese | 102,411,180 | 98,453,464,077 | 1.56 |
| 11 | nl | Dutch | 117,392,666 | 80,032,209,900 | 1.27 |
| 12 | ar | Arabic | 74,027,952 | 69,354,335,076 | 1.10 |
| 13 | tr | Turkish | 94,207,460 | 64,292,787,164 | 1.02 |
| 14 | cs | Czech | 65,350,564 | 56,910,486,745 | 0.90 |
| 15 | fa | Persian | 59,531,144 | 45,947,657,495 | 0.73 |
| 16 | hu | Hungarian | 44,132,152 | 43,417,981,714 | 0.69 |
| 17 | el | Greek | 51,430,226 | 43,147,590,757 | 0.68 |
| 18 | ro | Romanian | 40,325,424 | 39,647,954,768 | 0.63 |
| 19 | sv | Swedish | 49,709,189 | 38,486,181,494 | 0.61 |
| 20 | uk | Ukrainian | 44,740,545 | 38,226,128,686 | 0.61 |
| 21 | fi | Finnish | 30,467,667 | 28,925,009,180 | 0.46 |
| 22 | ko | Korean | 20,557,310 | 24,765,448,392 | 0.39 |
| 23 | da | Danish | 25,429,808 | 22,921,651,314 | 0.36 |
| 24 | bg | Bulgarian | 24,131,819 | 22,917,954,776 | 0.36 |
| 25 | no | Norwegian | 18,907,310 | 18,426,628,868 | 0.29 |
| 26 | hi | Hindi | 19,665,355 | 16,791,362,871 | 0.27 |
| 27 | sk | Slovak | 18,582,517 | 16,442,669,076 | 0.26 |
| 28 | th | Thai | 20,960,550 | 15,717,374,014 | 0.25 |
| 29 | lt | Lithuanian | 13,339,785 | 14,247,110,836 | 0.23 |
| 30 | ca | Catalan | 15,531,777 | 12,530,288,006 | 0.20 |
| 31 | id | Indonesian | 23,251,368 | 12,062,966,061 | 0.19 |
| 32 | bn | Bangla | 12,436,596 | 9,572,929,804 | 0.15 |
| 33 | et | Estonian | 8,004,753 | 8,805,656,165 | 0.14 |
| 34 | sl | Slovenian | 7,335,378 | 8,007,587,522 | 0.13 |
| 35 | lv | Latvian | 7,136,587 | 7,845,180,319 | 0.12 |
| 36 | he | Hebrew | 4,653,979 | 4,937,152,096 | 0.08 |
| 37 | sr | Serbian | 4,053,166 | 4,619,482,725 | 0.07 |
| 38 | ta | Tamil | 4,728,460 | 4,378,078,610 | 0.07 |
| 39 | sq | Albanian | 5,205,579 | 3,648,893,215 | 0.06 |
| 40 | az | Azerbaijani | 5,084,505 | 3,513,351,967 | 0.06 |
| 41 | kk | Kazakh | 2,733,982 | 2,802,485,195 | 0.04 |
| 42 | ur | Urdu | 2,757,279 | 2,703,052,627 | 0.04 |
| 43 | ka | Georgian | 3,120,321 | 2,617,625,564 | 0.04 |
| 44 | hy | Armenian | 2,964,488 | 2,395,179,284 | 0.04 |
| 45 | is | Icelandic | 2,373,560 | 2,350,592,857 | 0.04 |
| 46 | ml | Malayalam | 2,693,052 | 2,100,556,809 | 0.03 |
| 47 | ne | Nepali | 3,124,040 | 2,061,601,961 | 0.03 |
| 48 | mk | Macedonian | 2,762,807 | 2,003,302,006 | 0.03 |
| 49 | mr | Marathi | 2,266,588 | 1,955,227,796 | 0.03 |
| 50 | mn | Mongolian | 1,928,828 | 1,850,667,656 | 0.03 |
| 51 | be | Belarusian | 1,643,486 | 1,791,473,041 | 0.03 |
| 52 | te | Telugu | 1,822,865 | 1,566,972,146 | 0.02 |
| 53 | gl | Galician | 1,785,963 | 1,382,539,693 | 0.02 |
| 54 | eu | Basque | 1,598,822 | 1,262,066,759 | 0.02 |
| 55 | kn | Kannada | 1,352,142 | 1,242,285,201 | 0.02 |
| 56 | gu | Gujarati | 1,162,878 | 1,131,730,537 | 0.02 |
| 57 | af | Afrikaans | 826,519 | 1,119,009,767 | 0.02 |
| 58 | my | Burmese | 865,575 | 882,606,546 | 0.01 |
| 59 | si | Sinhala | 753,655 | 880,289,097 | 0.01 |
| 60 | eo | Esperanto | 460,088 | 803,948,528 | 0.01 |
| 61 | km | Khmer | 1,013,181 | 746,664,132 | 0.01 |
| 62 | pa | Punjabi | 646,987 | 727,546,145 | 0.01 |
| 63 | cy | Welsh | 549,955 | 576,743,162 | 0.01 |
| 64 | ky | Kyrgyz | 570,922 | 501,442,620 | 0.01 |
| 65 | ga | Irish | 304,251 | 376,947,935 | 0.01 |
| 66 | ps | Pashto | 376,914 | 363,007,770 | 0.01 |
| 67 | am | Amharic | 243,349 | 358,206,762 | 0.01 |
| 68 | ku | Kurdish | 295,314 | 302,990,910 | 0.00 |
| 69 | tl | Filipino | 348,453 | 242,086,456 | 0.00 |
| 70 | yi | Yiddish | 141,156 | 217,584,643 | 0.00 |
| 71 | lo | Lao | 217,842 | 168,256,876 | 0.00 |
| 72 | fy | Western Frisian | 223,268 | 167,193,111 | 0.00 |
| 73 | sd | Sindhi | 109,162 | 147,487,058 | 0.00 |
| 74 | mg | Malagasy | 115,910 | 142,685,412 | 0.00 |
| 75 | or | Odia | 153,461 | 100,323,213 | 0.00 |
| 76 | as | Assamese | 52,627 | 83,787,896 | 0.00 |
| 77 | ug | Uyghur | 47,035 | 77,677,306 | 0.00 |
| 78 | uz | Uzbek | 87,219 | 75,250,787 | 0.00 |
| 79 | la | Latin | 48,968 | 44,176,580 | 0.00 |
| 80 | hr | Croatian | 460,690 | 40,796,811 | 0.00 |
| 81 | sw | Swahili | 66,506 | 30,708,309 | 0.00 |
| 82 | ms | Malay | 238,151 | 19,375,976 | 0.00 |
| 83 | br | Breton | 43,765 | 1 |



