40u1d10t/to-train-my-model
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/40u1d10t/to-train-my-model
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: af
data_files: 'af/*.jsonl.zst'
- config_name: ar
data_files: 'ar/*.jsonl.zst'
- config_name: az
data_files: 'az/*.jsonl.zst'
- config_name: be
data_files: 'be/*.jsonl.zst'
- config_name: bg
data_files: 'bg/*.jsonl.zst'
- config_name: bn
data_files: 'bn/*.jsonl.zst'
- config_name: ca
data_files: 'ca/*.jsonl.zst'
- config_name: cs
data_files: 'cs/*.jsonl.zst'
- config_name: cy
data_files: 'cy/*.jsonl.zst'
- config_name: da
data_files: 'da/*.jsonl.zst'
- config_name: de
data_files: 'de/*.jsonl.zst'
- config_name: el
data_files: 'el/*.jsonl.zst'
- config_name: en
data_files: 'en/*.jsonl.zst'
- config_name: eo
data_files: 'eo/*.jsonl.zst'
- config_name: es
data_files: 'es/*.jsonl.zst'
- config_name: et
data_files: 'et/*.jsonl.zst'
- config_name: eu
data_files: 'eu/*.jsonl.zst'
- config_name: fa
data_files: 'fa/*.jsonl.zst'
- config_name: fi
data_files: 'fi/*.jsonl.zst'
- config_name: fr
data_files: 'fr/*.jsonl.zst'
- config_name: ga
data_files: 'ga/*.jsonl.zst'
- config_name: gl
data_files: 'gl/*.jsonl.zst'
- config_name: gu
data_files: 'gu/*.jsonl.zst'
- config_name: hbs
data_files: 'hbs/*.jsonl.zst'
- config_name: he
data_files: 'he/*.jsonl.zst'
- config_name: hi
data_files: 'hi/*.jsonl.zst'
- config_name: hu
data_files: 'hu/*.jsonl.zst'
- config_name: hy
data_files: 'hy/*.jsonl.zst'
- config_name: id
data_files: 'id/*.jsonl.zst'
- config_name: is
data_files: 'is/*.jsonl.zst'
- config_name: it
data_files: 'it/*.jsonl.zst'
- config_name: ja
data_files: 'ja/*.jsonl.zst'
- config_name: ka
data_files: 'ka/*.jsonl.zst'
- config_name: kk
data_files: 'kk/*.jsonl.zst'
- config_name: kn
data_files: 'kn/*.jsonl.zst'
- config_name: ko
data_files: 'ko/*.jsonl.zst'
- config_name: ky
data_files: 'ky/*.jsonl.zst'
- config_name: la
data_files: 'la/*.jsonl.zst'
- config_name: lt
data_files: 'lt/*.jsonl.zst'
- config_name: lv
data_files: 'lv/*.jsonl.zst'
- config_name: mk
data_files: 'mk/*.jsonl.zst'
- config_name: ml
data_files: 'ml/*.jsonl.zst'
- config_name: mn
data_files: 'mn/*.jsonl.zst'
- config_name: mr
data_files: 'mr/*.jsonl.zst'
- config_name: ms
data_files: 'ms/*.jsonl.zst'
- config_name: mt
data_files: 'mt/*.jsonl.zst'
- config_name: my
data_files: 'my/*.jsonl.zst'
- config_name: nb
data_files: 'nb/*.jsonl.zst'
- config_name: ne
data_files: 'ne/*.jsonl.zst'
- config_name: nl
data_files: 'nl/*.jsonl.zst'
- config_name: nn
data_files: 'nn/*.jsonl.zst'
- config_name: pa
data_files: 'pa/*.jsonl.zst'
- config_name: pl
data_files: 'pl/*.jsonl.zst'
- config_name: ps
data_files: 'ps/*.jsonl.zst'
- config_name: pt
data_files: 'pt/*.jsonl.zst'
- config_name: ro
data_files: 'ro/*.jsonl.zst'
- config_name: ru
data_files: 'ru/*.jsonl.zst'
- config_name: si
data_files: 'si/*.jsonl.zst'
- config_name: sk
data_files: 'sk/*.jsonl.zst'
- config_name: sl
data_files: 'sl/*.jsonl.zst'
- config_name: so
data_files: 'so/*.jsonl.zst'
- config_name: sq
data_files: 'sq/*.jsonl.zst'
- config_name: sv
data_files: 'sv/*.jsonl.zst'
- config_name: sw
data_files: 'sw/*.jsonl.zst'
- config_name: ta
data_files: 'ta/*.jsonl.zst'
- config_name: te
data_files: 'te/*.jsonl.zst'
- config_name: th
data_files: 'th/*.jsonl.zst'
- config_name: tl
data_files: 'tl/*.jsonl.zst'
- config_name: tr
data_files: 'tr/*.jsonl.zst'
- config_name: tt
data_files: 'tt/*.jsonl.zst'
- config_name: uk
data_files: 'uk/*.jsonl.zst'
- config_name: ur
data_files: 'ur/*.jsonl.zst'
- config_name: uz
data_files: 'uz/*.jsonl.zst'
- config_name: vi
data_files: 'vi/*.jsonl.zst'
- config_name: zh
data_files: 'zh/*.jsonl.zst'
pretty_name: CulturaY
annotations_creators:
- no-annotation
language_creators:
- found
language:
- af
- ar
- az
- be
- bg
- bn
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- hbs
- he
- hi
- hu
- hy
- id
- is
- it
- ja
- ka
- kk
- kn
- ko
- ky
- la
- lt
- lv
- mk
- ml
- mn
- mr
- ms
- mt
- my
- nb
- ne
- nl
- nn
- pa
- pl
- ps
- pt
- ro
- ru
- si
- sk
- sl
- so
- sq
- sv
- sw
- ta
- te
- th
- tl
- tr
- tt
- uk
- ur
- uz
- vi
- zh
multilinguality:
- multilingual
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
license: cc-by-4.0
extra_gated_prompt: "By completing the form below, you acknowledge that the provided data is offered as is. Although we anticipate no problems, you accept full responsibility for any repercussions resulting from the use of this data. Furthermore, you agree that the data must not be utilized for malicious or harmful purposes towards humanity."
extra_gated_fields:
Name: text
Email: text
Affiliation: text
Country: text
Usecase: text
I have explicitly check with my jurisdiction and I confirm that downloading CulturaY is legal in the country/region where I am located right now, and for the use case that I have described above: checkbox
You agree to not attempt to determine the identity of individuals in this dataset: checkbox
---
## CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages
### Dataset Summary
From the team that brought you [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), we present CulturaY, another substantial multilingual dataset of 15TB (uncompressed)/3TB (zstd-compressed) that applies the same dataset cleaning methodology to the [HPLT v1.1](https://hplt-project.org/datasets/v1.1) dataset.
Please note that [HPLT v1.2](https://hplt-project.org/datasets/v1.2) has also been released and is an alternative verison with different cleaning methodolgies.
This data was used in part to train our SOTA Vietnamese model: [Vistral-7B-Chat](https://huggingface.co/Viet-Mistral/Vistral-7B-Chat).
Our annotations and arrangements are licensed under CC-BY-4.0, and we make the data available for fair use machine learning research.
But we make no claims as to the underlying copyrights of the work. This data was copied from the HPLT project, which in turn used the data from Common Crawl and the Internet Archive.
### Acknowledgement
We thank our collaborators at [UONLP - The Natural Language Processing Group at the University of Oregon](http://nlp.uoregon.edu/), and the computing resources of the managers of the Karolina Supercomputers.
We also thank our friends at [TurkuNLP](https://turkunlp.org) for their support.
### Data Breakdown:
There are 75 langauges, with the following breakdown:
| | Code | Language | # Documents | # Documents (%) | Size (GB) |
|----:|:------|:-------------------|:-------------|:-------|:---------|
| 0 | en | English | 523,235,685 | 43.84 | 1244.39 |
| 1 | zh | Chinese | 172,023,436 | 14.41 | 290.91 |
| 2 | ru | Russian | 59,185,035 | 4.96 | 424.55 |
| 3 | es | Spanish | 49,193,764 | 4.12 | 116.20 |
| 4 | de | German | 35,204,652 | 2.95 | 78.32 |
| 5 | fr | French | 33,063,792 | 2.77 | 69.66 |
| 6 | ja | Japanese | 27,641,765 | 2.32 | 74.71 |
| 7 | ko | Korean | 26,925,013 | 2.26 | 25.50 |
| 8 | it | Italian | 22,396,067 | 1.88 | 48.30 |
| 9 | pt | Portuguese | 18,367,640 | 1.54 | 39.09 |
| 10 | th | Thai | 16,330,227 | 1.37 | 32.09 |
| 11 | da | Danish | 13,547,169 | 1.13 | 18.40 |
| 12 | sv | Swedish | 13,049,359 | 1.09 | 19.29 |
| 13 | tr | Turkish | 12,659,104 | 1.06 | 29.14 |
| 14 | nl | Dutch | 12,454,669 | 1.04 | 22.58 |
| 15 | pl | Polish | 12,054,997 | 1.01 | 27.09 |
| 16 | hu | Hungarian | 11,939,984 | 1.00 | 17.63 |
| 17 | ro | Romanian | 11,578,945 | 0.97 | 18.57 |
| 18 | hbs | Serbo-Croatian | 8,880,450 | 0.74 | 14.65 |
| 19 | id | Indonesian | 8,473,141 | 0.71 | 16.23 |
| 20 | bg | Bulgarian | 6,698,866 | 0.56 | 18.63 |
| 21 | el | Greek | 6,674,496 | 0.56 | 29.61 |
| 22 | ar | Arabic | 6,427,386 | 0.54 | 28.04 |
| 23 | nb | Norwegian Bokmål | 5,925,942 | 0.50 | 10.14 |
| 24 | fi | Finnish | 5,379,100 | 0.45 | 10.08 |
| 25 | he | Hebrew | 5,320,279 | 0.45 | 12.06 |
| 26 | uk | Ukrainian | 5,311,749 | 0.45 | 31.55 |
| 27 | cs | Czech | 5,248,678 | 0.44 | 12.83 |
| 28 | fa | Persian | 5,111,868 | 0.43 | 26.23 |
| 29 | ms | Malay | 4,888,894 | 0.41 | 9.09 |
| 30 | sk | Slovak | 4,758,917 | 0.40 | 5.50 |
| 31 | ca | Catalan | 4,552,579 | 0.38 | 7.96 |
| 32 | vi | Vietnamese | 4,493,567 | 0.38 | 16.95 |
| 33 | hi | Hindi | 4,200,330 | 0.35 | 11.56 |
| 34 | bn | Bangla | 2,785,980 | 0.23 | 4.76 |
| 35 | lt | Lithuanian | 2,509,788 | 0.21 | 3.83 |
| 36 | sl | Slovenian | 2,252,359 | 0.19 | 3.21 |
| 37 | la | Latin | 2,147,688 | 0.18 | 1.42 |
| 38 | et | Estonian | 1,754,719 | 0.15 | 2.88 |
| 39 | az | Azerbaijani | 1,554,357 | 0.13 | 1.95 |
| 40 | lv | Latvian | 1,469,245 | 0.12 | 2.19 |
| 41 | ur | Urdu | 1,251,414 | 0.10 | 2.84 |
| 42 | ta | Tamil | 1,128,321 | 0.09 | 7.21 |
| 43 | gl | Galician | 1,101,337 | 0.09 | 1.31 |
| 44 | sq | Albanian | 1,081,763 | 0.09 | 1.73 |
| 45 | ne | Nepali | 860,657 | 0.07 | 1.91 |
| 46 | mk | Macedonian | 641,111 | 0.05 | 1.61 |
| 47 | af | Afrikaans | 636,976 | 0.05 | 0.77 |
| 48 | tl | Filipino | 575,221 | 0.05 | 1.09 |
| 49 | sw | Swahili | 571,247 | 0.05 | 0.60 |
| 50 | eu | Basque | 559,194 | 0.05 | 0.67 |
| 51 | is | Icelandic | 529,777 | 0.04 | 0.81 |
| 52 | ka | Georgian | 524,645 | 0.04 | 1.48 |
| 53 | hy | Armenian | 519,060 | 0.04 | 1.46 |
| 54 | my | Burmese | 513,729 | 0.04 | 1.91 |
| 55 | nn | Norwegian Nynorsk | 509,287 | 0.04 | 0.49 |
| 56 | ml | Malayalam | 487,912 | 0.04 | 2.02 |
| 57 | mn | Mongolian | 448,211 | 0.04 | 1.79 |
| 58 | be | Belarusian | 426,194 | 0.04 | 1.48 |
| 59 | uz | Uzbek | 423,865 | 0.04 | 1.19 |
| 60 | mr | Marathi | 398,138 | 0.03 | 1.28 |
| 61 | si | Sinhala | 337,785 | 0.03 | 1.55 |
| 62 | te | Telugu | 279,240 | 0.02 | 1.00 |
| 63 | kk | Kazakh | 274,770 | 0.02 | 1.07 |
| 64 | mt | Maltese | 265,605 | 0.02 | 0.90 |
| 65 | so | Somali | 261,100 | 0.02 | 0.24 |
| 66 | gu | Gujarati | 242,074 | 0.02 | 0.74 |
| 67 | kn | Kannada | 231,260 | 0.02 | 0.71 |
| 68 | cy | Welsh | 179,157 | 0.02 | 0.20 |
| 69 | ga | Irish | 134,796 | 0.01 | 0.15 |
| 70 | tt | Tatar | 131,731 | 0.01 | 0.41 |
| 71 | pa | Punjabi | 119,686 | 0.01 | 0.29 |
| 72 | eo | Esperanto | 114,598 | 0.01 | 0.17 |
| 73 | ps | Pashto | 99,783 | 0.01 | 0.23 |
| 74 | ky | Kyrgyz | 86,551 | 0.01 | 0.31 |
### Dataset structure
The dataset has a total of 6 columns, including:
- 2 columns `text, url` will be the two main columns in this dataset.
- the remaining columns `id, document_lang, scores, langs` belong to the original document in the HPLT V1.1 dataset, retained for debugging purposes. and will be removed in the future.
Therefore, when using, please only utilize the two columns text and url.
### Process for Creating CulturaY
Firstly, to create CulturaY, we began with the HPLT dataset (version 1.1). This is also a notable difference between X and Y. While X was generated from cleaning data from Common Crawl (mC4, Oscar), Y was generated from cleaning raw data from the Internet Archive (HPLT). While Common Crawl is quite popular, data from the Internet Archive is less known and exploited, even though the data from both sources are similar. HPLT or CulturaY could be considered the first publicly released datasets originating from the Internet Archive. Using both CulturaX and CulturaY simultaneously will help your model have a more diverse source of data.
Our pipeline is built based on Bloom's data cleaning pipeline: evaluating each document in the dataset according to criteria such as document length, perplexity, bad words ratio, etc., and removing documents that do not perform well in any of these criteria.
See our [Blog](https://www.ontocord.ai/blog/cultura-y) for more details.
### Citation
To cite CulturaY, please use:
```
@misc{nguyen2024culturay,
title={CulturaY: A Large Cleaned Multilingual Dataset of 75 Languages},
author={Thuat Nguyen, Huu Nguyen and Thien Nguyen},
year={2024},
}
```
提供机构:
40u1d10t



