five

40u1d10t/to-train-my-model

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/40u1d10t/to-train-my-model
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: af data_files: 'af/*.jsonl.zst' - config_name: ar data_files: 'ar/*.jsonl.zst' - config_name: az data_files: 'az/*.jsonl.zst' - config_name: be data_files: 'be/*.jsonl.zst' - config_name: bg data_files: 'bg/*.jsonl.zst' - config_name: bn data_files: 'bn/*.jsonl.zst' - config_name: ca data_files: 'ca/*.jsonl.zst' - config_name: cs data_files: 'cs/*.jsonl.zst' - config_name: cy data_files: 'cy/*.jsonl.zst' - config_name: da data_files: 'da/*.jsonl.zst' - config_name: de data_files: 'de/*.jsonl.zst' - config_name: el data_files: 'el/*.jsonl.zst' - config_name: en data_files: 'en/*.jsonl.zst' - config_name: eo data_files: 'eo/*.jsonl.zst' - config_name: es data_files: 'es/*.jsonl.zst' - config_name: et data_files: 'et/*.jsonl.zst' - config_name: eu data_files: 'eu/*.jsonl.zst' - config_name: fa data_files: 'fa/*.jsonl.zst' - config_name: fi data_files: 'fi/*.jsonl.zst' - config_name: fr data_files: 'fr/*.jsonl.zst' - config_name: ga data_files: 'ga/*.jsonl.zst' - config_name: gl data_files: 'gl/*.jsonl.zst' - config_name: gu data_files: 'gu/*.jsonl.zst' - config_name: hbs data_files: 'hbs/*.jsonl.zst' - config_name: he data_files: 'he/*.jsonl.zst' - config_name: hi data_files: 'hi/*.jsonl.zst' - config_name: hu data_files: 'hu/*.jsonl.zst' - config_name: hy data_files: 'hy/*.jsonl.zst' - config_name: id data_files: 'id/*.jsonl.zst' - config_name: is data_files: 'is/*.jsonl.zst' - config_name: it data_files: 'it/*.jsonl.zst' - config_name: ja data_files: 'ja/*.jsonl.zst' - config_name: ka data_files: 'ka/*.jsonl.zst' - config_name: kk data_files: 'kk/*.jsonl.zst' - config_name: kn data_files: 'kn/*.jsonl.zst' - config_name: ko data_files: 'ko/*.jsonl.zst' - config_name: ky data_files: 'ky/*.jsonl.zst' - config_name: la data_files: 'la/*.jsonl.zst' - config_name: lt data_files: 'lt/*.jsonl.zst' - config_name: lv data_files: 'lv/*.jsonl.zst' - config_name: mk data_files: 'mk/*.jsonl.zst' - config_name: ml data_files: 'ml/*.jsonl.zst' - config_name: mn data_files: 'mn/*.jsonl.zst' - config_name: mr data_files: 'mr/*.jsonl.zst' - config_name: ms data_files: 'ms/*.jsonl.zst' - config_name: mt data_files: 'mt/*.jsonl.zst' - config_name: my data_files: 'my/*.jsonl.zst' - config_name: nb data_files: 'nb/*.jsonl.zst' - config_name: ne data_files: 'ne/*.jsonl.zst' - config_name: nl data_files: 'nl/*.jsonl.zst' - config_name: nn data_files: 'nn/*.jsonl.zst' - config_name: pa data_files: 'pa/*.jsonl.zst' - config_name: pl data_files: 'pl/*.jsonl.zst' - config_name: ps data_files: 'ps/*.jsonl.zst' - config_name: pt data_files: 'pt/*.jsonl.zst' - config_name: ro data_files: 'ro/*.jsonl.zst' - config_name: ru data_files: 'ru/*.jsonl.zst' - config_name: si data_files: 'si/*.jsonl.zst' - config_name: sk data_files: 'sk/*.jsonl.zst' - config_name: sl data_files: 'sl/*.jsonl.zst' - config_name: so data_files: 'so/*.jsonl.zst' - config_name: sq data_files: 'sq/*.jsonl.zst' - config_name: sv data_files: 'sv/*.jsonl.zst' - config_name: sw data_files: 'sw/*.jsonl.zst' - config_name: ta data_files: 'ta/*.jsonl.zst' - config_name: te data_files: 'te/*.jsonl.zst' - config_name: th data_files: 'th/*.jsonl.zst' - config_name: tl data_files: 'tl/*.jsonl.zst' - config_name: tr data_files: 'tr/*.jsonl.zst' - config_name: tt data_files: 'tt/*.jsonl.zst' - config_name: uk data_files: 'uk/*.jsonl.zst' - config_name: ur data_files: 'ur/*.jsonl.zst' - config_name: uz data_files: 'uz/*.jsonl.zst' - config_name: vi data_files: 'vi/*.jsonl.zst' - config_name: zh data_files: 'zh/*.jsonl.zst' pretty_name: CulturaY annotations_creators: - no-annotation language_creators: - found language: - af - ar - az - be - bg - bn - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - ga - gl - gu - hbs - he - hi - hu - hy - id - is - it - ja - ka - kk - kn - ko - ky - la - lt - lv - mk - ml - mn - mr - ms - mt - my - nb - ne - nl - nn - pa - pl - ps - pt - ro - ru - si - sk - sl - so - sq - sv - sw - ta - te - th - tl - tr - tt - uk - ur - uz - vi - zh multilinguality: - multilingual size_categories: - n<1K - 1K<n<10K - 10K<n<100K - 100K<n<1M - 1M<n<10M - 10M<n<100M - 100M<n<1B - 1B<n<10B source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling license: cc-by-4.0 extra_gated_prompt: "By completing the form below, you acknowledge that the provided data is offered as is. Although we anticipate no problems, you accept full responsibility for any repercussions resulting from the use of this data. Furthermore, you agree that the data must not be utilized for malicious or harmful purposes towards humanity." extra_gated_fields: Name: text Email: text Affiliation: text Country: text Usecase: text I have explicitly check with my jurisdiction and I confirm that downloading CulturaY is legal in the country/region where I am located right now, and for the use case that I have described above: checkbox You agree to not attempt to determine the identity of individuals in this dataset: checkbox --- ## CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages ### Dataset Summary From the team that brought you [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the [HPLT v1.1](https://hplt-project.org/datasets/v1.1) dataset. Please note that [HPLT v1.2](https://hplt-project.org/datasets/v1.2) has also been released and is an alternative verison with different cleaning methodolgies. This data was used in part to train our SOTA Vietnamese model: [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat). Our annotations and arrangements are licensed under CC-BY-4.0, and we make the data available for fair use machine learning research. But we make no claims as to the underlying copyrights of the work. This data was copied from the HPLT project, which in turn used the data from Common Crawl and the Internet Archive. ### Acknowledgement We thank our collaborators at [UONLP - The Natural Language Processing Group at the University of Oregon](http://nlp.uoregon.edu/), and the computing resources of the managers of the Karolina Supercomputers. We also thank our friends at [TurkuNLP](https://turkunlp.org) for their support. ### Data Breakdown: There are 75 langauges, with the following breakdown: | | Code | Language | # Documents | # Documents (%) | Size (GB) | |----:|:------|:-------------------|:-------------|:-------|:---------| | 0 | en | English | 523,235,685 | 43.84 | 1244.39 | | 1 | zh | Chinese | 172,023,436 | 14.41 | 290.91 | | 2 | ru | Russian | 59,185,035 | 4.96 | 424.55 | | 3 | es | Spanish | 49,193,764 | 4.12 | 116.20 | | 4 | de | German | 35,204,652 | 2.95 | 78.32 | | 5 | fr | French | 33,063,792 | 2.77 | 69.66 | | 6 | ja | Japanese | 27,641,765 | 2.32 | 74.71 | | 7 | ko | Korean | 26,925,013 | 2.26 | 25.50 | | 8 | it | Italian | 22,396,067 | 1.88 | 48.30 | | 9 | pt | Portuguese | 18,367,640 | 1.54 | 39.09 | | 10 | th | Thai | 16,330,227 | 1.37 | 32.09 | | 11 | da | Danish | 13,547,169 | 1.13 | 18.40 | | 12 | sv | Swedish | 13,049,359 | 1.09 | 19.29 | | 13 | tr | Turkish | 12,659,104 | 1.06 | 29.14 | | 14 | nl | Dutch | 12,454,669 | 1.04 | 22.58 | | 15 | pl | Polish | 12,054,997 | 1.01 | 27.09 | | 16 | hu | Hungarian | 11,939,984 | 1.00 | 17.63 | | 17 | ro | Romanian | 11,578,945 | 0.97 | 18.57 | | 18 | hbs | Serbo-Croatian | 8,880,450 | 0.74 | 14.65 | | 19 | id | Indonesian | 8,473,141 | 0.71 | 16.23 | | 20 | bg | Bulgarian | 6,698,866 | 0.56 | 18.63 | | 21 | el | Greek | 6,674,496 | 0.56 | 29.61 | | 22 | ar | Arabic | 6,427,386 | 0.54 | 28.04 | | 23 | nb | Norwegian Bokmål | 5,925,942 | 0.50 | 10.14 | | 24 | fi | Finnish | 5,379,100 | 0.45 | 10.08 | | 25 | he | Hebrew | 5,320,279 | 0.45 | 12.06 | | 26 | uk | Ukrainian | 5,311,749 | 0.45 | 31.55 | | 27 | cs | Czech | 5,248,678 | 0.44 | 12.83 | | 28 | fa | Persian | 5,111,868 | 0.43 | 26.23 | | 29 | ms | Malay | 4,888,894 | 0.41 | 9.09 | | 30 | sk | Slovak | 4,758,917 | 0.40 | 5.50 | | 31 | ca | Catalan | 4,552,579 | 0.38 | 7.96 | | 32 | vi | Vietnamese | 4,493,567 | 0.38 | 16.95 | | 33 | hi | Hindi | 4,200,330 | 0.35 | 11.56 | | 34 | bn | Bangla | 2,785,980 | 0.23 | 4.76 | | 35 | lt | Lithuanian | 2,509,788 | 0.21 | 3.83 | | 36 | sl | Slovenian | 2,252,359 | 0.19 | 3.21 | | 37 | la | Latin | 2,147,688 | 0.18 | 1.42 | | 38 | et | Estonian | 1,754,719 | 0.15 | 2.88 | | 39 | az | Azerbaijani | 1,554,357 | 0.13 | 1.95 | | 40 | lv | Latvian | 1,469,245 | 0.12 | 2.19 | | 41 | ur | Urdu | 1,251,414 | 0.10 | 2.84 | | 42 | ta | Tamil | 1,128,321 | 0.09 | 7.21 | | 43 | gl | Galician | 1,101,337 | 0.09 | 1.31 | | 44 | sq | Albanian | 1,081,763 | 0.09 | 1.73 | | 45 | ne | Nepali | 860,657 | 0.07 | 1.91 | | 46 | mk | Macedonian | 641,111 | 0.05 | 1.61 | | 47 | af | Afrikaans | 636,976 | 0.05 | 0.77 | | 48 | tl | Filipino | 575,221 | 0.05 | 1.09 | | 49 | sw | Swahili | 571,247 | 0.05 | 0.60 | | 50 | eu | Basque | 559,194 | 0.05 | 0.67 | | 51 | is | Icelandic | 529,777 | 0.04 | 0.81 | | 52 | ka | Georgian | 524,645 | 0.04 | 1.48 | | 53 | hy | Armenian | 519,060 | 0.04 | 1.46 | | 54 | my | Burmese | 513,729 | 0.04 | 1.91 | | 55 | nn | Norwegian Nynorsk | 509,287 | 0.04 | 0.49 | | 56 | ml | Malayalam | 487,912 | 0.04 | 2.02 | | 57 | mn | Mongolian | 448,211 | 0.04 | 1.79 | | 58 | be | Belarusian | 426,194 | 0.04 | 1.48 | | 59 | uz | Uzbek | 423,865 | 0.04 | 1.19 | | 60 | mr | Marathi | 398,138 | 0.03 | 1.28 | | 61 | si | Sinhala | 337,785 | 0.03 | 1.55 | | 62 | te | Telugu | 279,240 | 0.02 | 1.00 | | 63 | kk | Kazakh | 274,770 | 0.02 | 1.07 | | 64 | mt | Maltese | 265,605 | 0.02 | 0.90 | | 65 | so | Somali | 261,100 | 0.02 | 0.24 | | 66 | gu | Gujarati | 242,074 | 0.02 | 0.74 | | 67 | kn | Kannada | 231,260 | 0.02 | 0.71 | | 68 | cy | Welsh | 179,157 | 0.02 | 0.20 | | 69 | ga | Irish | 134,796 | 0.01 | 0.15 | | 70 | tt | Tatar | 131,731 | 0.01 | 0.41 | | 71 | pa | Punjabi | 119,686 | 0.01 | 0.29 | | 72 | eo | Esperanto | 114,598 | 0.01 | 0.17 | | 73 | ps | Pashto | 99,783 | 0.01 | 0.23 | | 74 | ky | Kyrgyz | 86,551 | 0.01 | 0.31 | ### Dataset structure The dataset has a total of 6 columns, including: - 2 columns `text, url` will be the two main columns in this dataset. - the remaining columns `id, document_lang, scores, langs` belong to the original document in the HPLT V1.1 dataset, retained for debugging purposes. and will be removed in the future. Therefore, when using, please only utilize the two columns text and url. ### Process for Creating CulturaY Firstly, to create CulturaY, we began with the HPLT dataset (version 1.1). This is also a notable difference between X and Y. While X was generated from cleaning data from Common Crawl (mC4, Oscar), Y was generated from cleaning raw data from the Internet Archive (HPLT). While Common Crawl is quite popular, data from the Internet Archive is less known and exploited, even though the data from both sources are similar. HPLT or CulturaY could be considered the first publicly released datasets originating from the Internet Archive. Using both CulturaX and CulturaY simultaneously will help your model have a more diverse source of data. Our pipeline is built based on Bloom's data cleaning pipeline: evaluating each document in the dataset according to criteria such as document length, perplexity, bad words ratio, etc., and removing documents that do not perform well in any of these criteria. See our [Blog](https://www.ontocord.ai/blog/cultura-y) for more details. ### Citation To cite CulturaY, please use: ``` @misc{nguyen2024culturay, title={CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages}, author={Thuat Nguyen, Huu Nguyen and Thien Nguyen}, year={2024}, } ```
提供机构:
40u1d10t
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作