下载链接：

https://modelscope.cn/datasets/ontocord/CulturaY

下载链接

链接失效反馈

官方服务：

资源简介：

## CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages ### Dataset Summary From the team that brought you [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the [HPLT v1.1](https://hplt-project.org/datasets/v1.1) dataset. Please note that [HPLT v1.2](https://hplt-project.org/datasets/v1.2) has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat). Our annotations and arrangements are licensed under CC-BY-4.0, and we make the data available for fair use machine learning research. But we make no claims as to the underlying copyrights of the work. This data was copied from the HPLT project, which in turn used the data from Common Crawl and the Internet Archive. ### Acknowledgement We thank our collaborators at [UONLP - The Natural Language Processing Group at the University of Oregon](http://nlp.uoregon.edu/), and the computing resources of the managers of the Karolina Supercomputers. We also thank our friends at [TurkuNLP](https://turkunlp.org) for their support. ### Data Breakdown: There are 75 langauges, with the following breakdown: | | Code | Language | # Documents | # Documents (%) | Size (GB) | |----:|:------|:-------------------|:-------------|:-------|:---------| | 0 | en | English | 523,235,685 | 43.84 | 1244.39 | | 1 | zh | Chinese | 172,023,436 | 14.41 | 290.91 | | 2 | ru | Russian | 59,185,035 | 4.96 | 424.55 | | 3 | es | Spanish | 49,193,764 | 4.12 | 116.20 | | 4 | de | German | 35,204,652 | 2.95 | 78.32 | | 5 | fr | French | 33,063,792 | 2.77 | 69.66 | | 6 | ja | Japanese | 27,641,765 | 2.32 | 74.71 | | 7 | ko | Korean | 26,925,013 | 2.26 | 25.50 | | 8 | it | Italian | 22,396,067 | 1.88 | 48.30 | | 9 | pt | Portuguese | 18,367,640 | 1.54 | 39.09 | | 10 | th | Thai | 16,330,227 | 1.37 | 32.09 | | 11 | da | Danish | 13,547,169 | 1.13 | 18.40 | | 12 | sv | Swedish | 13,049,359 | 1.09 | 19.29 | | 13 | tr | Turkish | 12,659,104 | 1.06 | 29.14 | | 14 | nl | Dutch | 12,454,669 | 1.04 | 22.58 | | 15 | pl | Polish | 12,054,997 | 1.01 | 27.09 | | 16 | hu | Hungarian | 11,939,984 | 1.00 | 17.63 | | 17 | ro | Romanian | 11,578,945 | 0.97 | 18.57 | | 18 | hbs | Serbo-Croatian | 8,880,450 | 0.74 | 14.65 | | 19 | id | Indonesian | 8,473,141 | 0.71 | 16.23 | | 20 | bg | Bulgarian | 6,698,866 | 0.56 | 18.63 | | 21 | el | Greek | 6,674,496 | 0.56 | 29.61 | | 22 | ar | Arabic | 6,427,386 | 0.54 | 28.04 | | 23 | nb | Norwegian Bokmål | 5,925,942 | 0.50 | 10.14 | | 24 | fi | Finnish | 5,379,100 | 0.45 | 10.08 | | 25 | he | Hebrew | 5,320,279 | 0.45 | 12.06 | | 26 | uk | Ukrainian | 5,311,749 | 0.45 | 31.55 | | 27 | cs | Czech | 5,248,678 | 0.44 | 12.83 | | 28 | fa | Persian | 5,111,868 | 0.43 | 26.23 | | 29 | ms | Malay | 4,888,894 | 0.41 | 9.09 | | 30 | sk | Slovak | 4,758,917 | 0.40 | 5.50 | | 31 | ca | Catalan | 4,552,579 | 0.38 | 7.96 | | 32 | vi | Vietnamese | 4,493,567 | 0.38 | 16.95 | | 33 | hi | Hindi | 4,200,330 | 0.35 | 11.56 | | 34 | bn | Bangla | 2,785,980 | 0.23 | 4.76 | | 35 | lt | Lithuanian | 2,509,788 | 0.21 | 3.83 | | 36 | sl | Slovenian | 2,252,359 | 0.19 | 3.21 | | 37 | la | Latin | 2,147,688 | 0.18 | 1.42 | | 38 | et | Estonian | 1,754,719 | 0.15 | 2.88 | | 39 | az | Azerbaijani | 1,554,357 | 0.13 | 1.95 | | 40 | lv | Latvian | 1,469,245 | 0.12 | 2.19 | | 41 | ur | Urdu | 1,251,414 | 0.10 | 2.84 | | 42 | ta | Tamil | 1,128,321 | 0.09 | 7.21 | | 43 | gl | Galician | 1,101,337 | 0.09 | 1.31 | | 44 | sq | Albanian | 1,081,763 | 0.09 | 1.73 | | 45 | ne | Nepali | 860,657 | 0.07 | 1.91 | | 46 | mk | Macedonian | 641,111 | 0.05 | 1.61 | | 47 | af | Afrikaans | 636,976 | 0.05 | 0.77 | | 48 | tl | Filipino | 575,221 | 0.05 | 1.09 | | 49 | sw | Swahili | 571,247 | 0.05 | 0.60 | | 50 | eu | Basque | 559,194 | 0.05 | 0.67 | | 51 | is | Icelandic | 529,777 | 0.04 | 0.81 | | 52 | ka | Georgian | 524,645 | 0.04 | 1.48 | | 53 | hy | Armenian | 519,060 | 0.04 | 1.46 | | 54 | my | Burmese | 513,729 | 0.04 | 1.91 | | 55 | nn | Norwegian Nynorsk | 509,287 | 0.04 | 0.49 | | 56 | ml | Malayalam | 487,912 | 0.04 | 2.02 | | 57 | mn | Mongolian | 448,211 | 0.04 | 1.79 | | 58 | be | Belarusian | 426,194 | 0.04 | 1.48 | | 59 | uz | Uzbek | 423,865 | 0.04 | 1.19 | | 60 | mr | Marathi | 398,138 | 0.03 | 1.28 | | 61 | si | Sinhala | 337,785 | 0.03 | 1.55 | | 62 | te | Telugu | 279,240 | 0.02 | 1.00 | | 63 | kk | Kazakh | 274,770 | 0.02 | 1.07 | | 64 | mt | Maltese | 265,605 | 0.02 | 0.90 | | 65 | so | Somali | 261,100 | 0.02 | 0.24 | | 66 | gu | Gujarati | 242,074 | 0.02 | 0.74 | | 67 | kn | Kannada | 231,260 | 0.02 | 0.71 | | 68 | cy | Welsh | 179,157 | 0.02 | 0.20 | | 69 | ga | Irish | 134,796 | 0.01 | 0.15 | | 70 | tt | Tatar | 131,731 | 0.01 | 0.41 | | 71 | pa | Punjabi | 119,686 | 0.01 | 0.29 | | 72 | eo | Esperanto | 114,598 | 0.01 | 0.17 | | 73 | ps | Pashto | 99,783 | 0.01 | 0.23 | | 74 | ky | Kyrgyz | 86,551 | 0.01 | 0.31 | ### Dataset structure The dataset has a total of 6 columns, including: - 2 columns `text, url` will be the two main columns in this dataset. - the remaining columns `id, document_lang, scores, langs` belong to the original document in the HPLT V1.1 dataset, retained for debugging purposes. and will be removed in the future. Therefore, when using, please only utilize the two columns text and url. ### Process for Creating CulturaY Firstly, to create CulturaY, we began with the HPLT dataset (version 1.1). This is also a notable difference between X and Y. While X was generated from cleaning data from Common Crawl (mC4, Oscar), Y was generated from cleaning raw data from the Internet Archive (HPLT). While Common Crawl is quite popular, data from the Internet Archive is less known and exploited, even though the data from both sources are similar. HPLT or CulturaY could be considered the first publicly released datasets originating from the Internet Archive. Using both CulturaX and CulturaY simultaneously will help your model have a more diverse source of data. Our pipeline is built based on Bloom's data cleaning pipeline: evaluating each document in the dataset according to criteria such as document length, perplexity, bad words ratio, etc., and removing documents that do not perform well in any of these criteria. See our [Blog](https://www.ontocord.ai/blog/cultura-y) for more details. ### Citation To cite CulturaY, please use: ``` @misc{nguyen2024culturay, title={CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages}, author={Thuat Nguyen, Huu Nguyen and Thien Nguyen}, year={2024}, } ```

# CulturaY: 75种语言的大规模清洗多语言数据集 ### 数据集概述推出过[CulturaX](https://huggingface.co/datasets/uonlp/CulturaX)的团队，此番为您带来CulturaY——另一体量庞大的多语言数据集，未压缩尺寸达15TB、zstd压缩后为3TB，其针对[HPLT v1.1](https://hplt-project.org/datasets/v1.1)数据集采用了同款数据清洗流程。请注意，[HPLT v1.2](https://hplt-project.org/datasets/v1.2)亦已发布，其采用了不同的清洗方法，是本数据集的可选版本。本数据集曾部分用于训练我们的当前最优（SOTA）越南语模型：[Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat)。我们的标注与编排采用CC-BY-4.0协议授权，本数据集仅供公平使用的机器学习研究使用。但我们不对原作品的底层版权归属作出任何声明。本数据集源自HPLT项目，而该项目的数据又取自通用爬虫（Common Crawl）与互联网档案馆（Internet Archive）。 ### 致谢致谢俄勒冈大学自然语言处理小组（UONLP - The Natural Language Processing Group at the University of Oregon，http://nlp.uoregon.edu/）的合作者们，以及卡罗莱纳超级计算机（Karolina Supercomputers）运维团队提供的计算资源。同时感谢TurkuNLP团队的朋友们给予的支持。 ### 数据分布本数据集涵盖75种语言，具体分布如下： | | 语言代码 | 语言名称 | 文档数量 | 文档占比（%） | 大小（GB） | |----:|:--------|:-------------------|:-------------|:-------|:---------| | 0 | en | 英语 | 523,235,685 | 43.84 | 1244.39 | | 1 | zh | 汉语 | 172,023,436 | 14.41 | 290.91 | | 2 | ru | 俄语 | 59,185,035 | 4.96 | 424.55 | | 3 | es | 西班牙语 | 49,193,764 | 4.12 | 116.20 | | 4 | de | 德语 | 35,204,652 | 2.95 | 78.32 | | 5 | fr | 法语 | 33,063,792 | 2.77 | 69.66 | | 6 | ja | 日语 | 27,641,765 | 2.32 | 74.71 | | 7 | ko | 韩语 | 26,925,013 | 2.26 | 25.50 | | 8 | it | 意大利语 | 22,396,067 | 1.88 | 48.30 | | 9 | pt | 葡萄牙语 | 18,367,640 | 1.54 | 39.09 | | 10 | th | 泰语 | 16,330,227 | 1.37 | 32.09 | | 11 | da | 丹麦语 | 13,547,169 | 1.13 | 18.40 | | 12 | sv | 瑞典语 | 13,049,359 | 1.09 | 19.29 | | 13 | tr | 土耳其语 | 12,659,104 | 1.06 | 29.14 | | 14 | nl | 荷兰语 | 12,454,669 | 1.04 | 22.58 | | 15 | pl | 波兰语 | 12,054,997 | 1.01 | 27.09 | | 16 | hu | 匈牙利语 | 11,939,984 | 1.00 | 17.63 | | 17 | ro | 罗马尼亚语 | 11,578,945 | 0.97 | 18.57 | | 18 | hbs | 塞尔维亚-克罗地亚语 | 8,880,450 | 0.74 | 14.65 | | 19 | id | 印度尼西亚语 | 8,473,141 | 0.71 | 16.23 | | 20 | bg | 保加利亚语 | 6,698,866 | 0.56 | 18.63 | | 21 | el | 希腊语 | 6,674,496 | 0.56 | 29.61 | | 22 | ar | 阿拉伯语 | 6,427,386 | 0.54 | 28.04 | | 23 | nb | 挪威博克马尔语 | 5,925,942 | 0.50 | 10.14 | | 24 | fi | 芬兰语 | 5,379,100 | 0.45 | 10.08 | | 25 | he | 希伯来语 | 5,320,279 | 0.45 | 12.06 | | 26 | uk | 乌克兰语 | 5,311,749 | 0.45 | 31.55 | | 27 | cs | 捷克语 | 5,248,678 | 0.44 | 12.83 | | 28 | fa | 波斯语 | 5,111,868 | 0.43 | 26.23 | | 29 | ms | 马来语 | 4,888,894 | 0.41 | 9.09 | | 30 | sk | 斯洛伐克语 | 4,758,917 | 0.40 | 5.50 | | 31 | ca | 加泰罗尼亚语 | 4,552,579 | 0.38 | 7.96 | | 32 | vi | 越南语 | 4,493,567 | 0.38 | 16.95 | | 33 | hi | 印地语 | 4,200,330 | 0.35 | 11.56 | | 34 | bn | 孟加拉语 | 2,785,980 | 0.23 | 4.76 | | 35 | lt | 立陶宛语 | 2,509,788 | 0.21 | 3.83 | | 36 | sl | 斯洛文尼亚语 | 2,252,359 | 0.19 | 3.21 | | 37 | la | 拉丁语 | 2,147,688 | 0.18 | 1.42 | | 38 | et | 爱沙尼亚语 | 1,754,719 | 0.15 | 2.88 | | 39 | az | 阿塞拜疆语 | 1,554,357 | 0.13 | 1.95 | | 40 | lv | 拉脱维亚语 | 1,469,245 | 0.12 | 2.19 | | 41 | ur | 乌尔都语 | 1,251,414 | 0.10 | 2.84 | | 42 | ta | 泰米尔语 | 1,128,321 | 0.09 | 7.21 | | 43 | gl | 加利西亚语 | 1,101,337 | 0.09 | 1.31 | | 44 | sq | 阿尔巴尼亚语 | 1,081,763 | 0.09 | 1.73 | | 45 | ne | 尼泊尔语 | 860,657 | 0.07 | 1.91 | | 46 | mk | 马其顿语 | 641,111 | 0.05 | 1.61 | | 47 | af | 南非荷兰语 | 636,976 | 0.05 | 0.77 | | 48 | tl | 他加禄语（菲律宾语） | 575,221 | 0.05 | 1.09 | | 49 | sw | 斯瓦希里语 | 571,247 | 0.05 | 0.60 | | 50 | eu | 巴斯克语 | 559,194 | 0.05 | 0.67 | | 51 | is | 冰岛语 | 529,777 | 0.04 | 0.81 | | 52 | ka | 格鲁吉亚语 | 524,645 | 0.04 | 1.48 | | 53 | hy | 亚美尼亚语 | 519,060 | 0.04 | 1.46 | | 54 | my | 缅甸语 | 513,729 | 0.04 | 1.91 | | 55 | nn | 挪威尼诺斯克语 | 509,287 | 0.04 | 0.49 | | 56 | ml | 马拉雅拉姆语 | 487,912 | 0.04 | 2.02 | | 57 | mn | 蒙古语 | 448,211 | 0.04 | 1.79 | | 58 | be | 白俄罗斯语 | 426,194 | 0.04 | 1.48 | | 59 | uz | 乌兹别克语 | 423,865 | 0.04 | 1.19 | | 60 | mr | 马拉地语 | 398,138 | 0.03 | 1.28 | | 61 | si | 僧伽罗语 | 337,785 | 0.03 | 1.55 | | 62 | te | 泰卢固语 | 279,240 | 0.02 | 1.00 | | 63 | kk | 哈萨克语 | 274,770 | 0.02 | 1.07 | | 64 | mt | 马耳他语 | 265,605 | 0.02 | 0.90 | | 65 | so | 索马里语 | 261,100 | 0.02 | 0.24 | | 66 | gu | 古吉拉特语 | 242,074 | 0.02 | 0.74 | | 67 | kn | 卡纳达语 | 231,260 | 0.02 | 0.71 | | 68 | cy | 威尔士语 | 179,157 | 0.02 | 0.20 | | 69 | ga | 爱尔兰语 | 134,796 | 0.01 | 0.15 | | 70 | tt | 鞑靼语 | 131,731 | 0.01 | 0.41 | | 71 | pa | 旁遮普语 | 119,686 | 0.01 | 0.29 | | 72 | eo | 世界语 | 114,598 | 0.01 | 0.17 | | 73 | ps | 普什图语 | 99,783 | 0.01 | 0.23 | | 74 | ky | 吉尔吉斯语 | 86,551 | 0.01 | 0.31 | ### 数据集结构本数据集共包含6列，其中： - `text`与`url`两列为数据集的核心主列； - 其余四列`id`、`document_lang`、`scores`与`langs`均源自HPLT v1.1数据集的原始文档，仅用于调试目的，未来将予以移除。因此在使用本数据集时，请仅调用`text`与`url`两列。 ### CulturaY 构建流程首先，CulturaY以HPLT v1.1数据集为基础构建，这也是CulturaX与CulturaY的显著差异所在：CulturaX源自对通用爬虫（Common Crawl）数据（mC4、Oscar）的清洗，而CulturaY则源自对互联网档案馆（Internet Archive）原始数据（HPLT）的清洗。尽管通用爬虫已广为人知，但互联网档案馆的数据虽二者数据源属性相近，却较少被挖掘与利用。HPLT与CulturaY均可视为首批源自互联网档案馆的公开多语言数据集。同时使用CulturaX与CulturaY，可帮助模型获得更为多元的训练数据源。我们的清洗流程基于Bloom的数据清洗管道构建：依据文档长度、困惑度、不良词汇占比等指标对数据集中的每份文档进行评估，并剔除未达任一指标要求的文档。更多细节可参阅我们的[博客](https://www.ontocord.ai/blog/cultura-y)。 ### 引用格式若需引用CulturaY，请使用以下格式： @misc{nguyen2024culturay, title={CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages}, author={Thuat Nguyen, Huu Nguyen and Thien Nguyen}, year={2024}, }

应用场景：