five

BramVanroy/wikipedia_culturax_dutch

收藏
Hugging Face2024-12-23 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/BramVanroy/wikipedia_culturax_dutch
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - nl size_categories: - 10B<n<100B task_categories: - text-generation - text2text-generation pretty_name: Filtered CulturaX + Wikipedia for Dutch dataset_info: - config_name: 100M features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 738455828.5851797 num_examples: 1018200 - name: test num_bytes: 7458534.414820259 num_examples: 10284 download_size: 411183119 dataset_size: 745914363.0 - config_name: 100k features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 745955.3074739829 num_examples: 1047 - name: test num_bytes: 7124.692526017029 num_examples: 10 download_size: 366788 dataset_size: 753080.0 - config_name: 10B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 66539945646.34457 num_examples: 40176566 - name: test num_bytes: 105996030.65543362 num_examples: 64000 download_size: 42132184504 dataset_size: 66645941677.0 - config_name: 10M features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 76734151.72157606 num_examples: 139851 - name: test num_bytes: 774743.2784239326 num_examples: 1412 download_size: 37995388 dataset_size: 77508895.0 - config_name: 10k features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 72048.30379746835 num_examples: 78 - name: test num_bytes: 5896 num_examples: 1 download_size: 47197 dataset_size: 77944.30379746835 - config_name: 15B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 99730049355.25276 num_examples: 59584123 - name: test num_bytes: 107121206.74724333 num_examples: 64000 download_size: 63139415312 dataset_size: 99837170562.0 - config_name: 1B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 6797502496.392602 num_examples: 5102360 - name: test num_bytes: 68660322.60739774 num_examples: 51538 download_size: 4260450464 dataset_size: 6866162819.0 - config_name: 1M features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 7442665.619329753 num_examples: 10694 - name: test num_bytes: 75164.38067024625 num_examples: 108 download_size: 3845466 dataset_size: 7517830.0 - config_name: 20B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 132920704365.75093 num_examples: 78991679 - name: test num_bytes: 107693939.24907027 num_examples: 64000 download_size: 84141456153 dataset_size: 133028398305.0 - config_name: 25B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 166111586295.01904 num_examples: 98399236 - name: test num_bytes: 108040894.98094498 num_examples: 64000 download_size: 105147418131 dataset_size: 166219627190.0 - config_name: 30B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 199302582477.5805 num_examples: 117806793 - name: test num_bytes: 108273597.41950662 num_examples: 64000 download_size: 126152714564 dataset_size: 199410856075.0 - config_name: 35B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 232493644456.181 num_examples: 137214350 - name: test num_bytes: 108440503.81899258 num_examples: 64000 download_size: 147149925109 dataset_size: 232602084960.0 - config_name: 40B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 265684747781.7734 num_examples: 156621907 - name: test num_bytes: 108566063.22660531 num_examples: 64000 download_size: 168152290262 dataset_size: 265793313845.0 - config_name: 45B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 298875877641.391 num_examples: 176029463 - name: test num_bytes: 108663946.60903454 num_examples: 64000 download_size: 189159571162 dataset_size: 298984541588.0 - config_name: 50B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 332067028077.12775 num_examples: 195437020 - name: test num_bytes: 108742395.87226707 num_examples: 64000 download_size: 210160621183 dataset_size: 332175770473.0 - config_name: 55B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 365258192681.75964 num_examples: 214844577 - name: test num_bytes: 108806676.24034382 num_examples: 64000 download_size: 231164757019 dataset_size: 365366999358.0 - config_name: 5B features: - name: text dtype: string - name: url dtype: string - name: source dtype: string splits: - name: train num_bytes: 33351938314.309906 num_examples: 20769009 - name: test num_bytes: 102774477.69009268 num_examples: 64000 download_size: 21119808690 dataset_size: 33454712792.0 configs: - config_name: 100M data_files: - split: train path: 100M/train-* - split: test path: 100M/test-* - config_name: 100k data_files: - split: train path: 100k/train-* - split: test path: 100k/test-* - config_name: 10B data_files: - split: train path: 10B/train-* - split: test path: 10B/test-* - config_name: 10M data_files: - split: train path: 10M/train-* - split: test path: 10M/test-* - config_name: 10k data_files: - split: train path: 10k/train-* - split: test path: 10k/test-* - config_name: 15B data_files: - split: train path: 15B/train-* - split: test path: 15B/test-* - config_name: 1B data_files: - split: train path: 1B/train-* - split: test path: 1B/test-* - config_name: 1M data_files: - split: train path: 1M/train-* - split: test path: 1M/test-* - config_name: 20B data_files: - split: train path: 20B/train-* - split: test path: 20B/test-* - config_name: 25B data_files: - split: train path: 25B/train-* - split: test path: 25B/test-* - config_name: 30B data_files: - split: train path: 30B/train-* - split: test path: 30B/test-* - config_name: 35B data_files: - split: train path: 35B/train-* - split: test path: 35B/test-* - config_name: 40B data_files: - split: train path: 40B/train-* - split: test path: 40B/test-* - config_name: 45B data_files: - split: train path: 45B/train-* - split: test path: 45B/test-* - config_name: 50B data_files: - split: train path: 50B/train-* - split: test path: 50B/test-* - config_name: 55B data_files: - split: train path: 55B/train-* - split: test path: 55B/test-* - config_name: 5B data_files: - split: train path: 5B/train-* - split: test path: 5B/test-* --- # Filtered CulturaX + Wikipedia for Dutch This is a combined and filtered version of [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) and [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), only including Dutch. It is intended for the training of LLMs. Different configs are available based on the number of tokens (see a section below with an overview). This can be useful if you want to know exactly how many tokens you have. Great for using as a streaming dataset, too. Tokens are counted as white-space tokens, so depending on your tokenizer, you'll likely end up with more tokens than indicated here. Every config also has a test set (for validation) of 1% the total size of the dataset, minimally 1 max. 64k samples (~16M tokens). Wikipedia and CulturaX were shuffled before merging and the test set creation was also shuffled. Priority is given to Wikipedia to prioritize knowledge and cultural content, so the smaller configs will consist exclusively of Wikipedia and for the larger configs we augment with CulturaX. Every config builds further on the previous, so this means that every config contains the same data as the smaller ones and more HOWEVER their train/test splits are not the same, so test set of one config may overlap with samples for another training set. This is usually not a problem but just be aware that you do not train on one config's training set and test with another config's test set. ## Citation If you use [Fietje](https://huggingface.co/BramVanroy/fietje-2) or the [CulturaX + Wikipedia filtered subset](https://huggingface.co/datasets/BramVanroy/wikipedia_culturax_dutch) in your work, please cite to the following paper: ```bibtex @misc{vanroy2024fietjeopenefficientllm, title={Fietje: An open, efficient LLM for Dutch}, author={Bram Vanroy}, year={2024}, eprint={2412.15450}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2412.15450}, } ``` ## Configs ### `10k` -- 79 samples -- 10,087 tokens - ratio_wikipedia: 100.00% - total_num_tokens: 10,087 - train_num_tokens: 9,205 - test_num_tokens: 882 - total_num_samples: 79 - train_num_samples: 78 - test_num_samples: 1 ### `100k` -- 1,057 samples -- 100,075 tokens - ratio_wikipedia: 100.00% - total_num_tokens: 100,075 - train_num_tokens: 98,044 - test_num_tokens: 2,031 - total_num_samples: 1,057 - train_num_samples: 1,047 - test_num_samples: 10 ### `1M` -- 10,802 samples -- 1,000,239 tokens - ratio_wikipedia: 100.00% - total_num_tokens: 1,000,239 - train_num_tokens: 991,119 - test_num_tokens: 9,120 - total_num_samples: 10,802 - train_num_samples: 10,694 - test_num_samples: 108 ### `10M` -- 141,263 samples -- 10,000,022 tokens - ratio_wikipedia: 100.00% - total_num_tokens: 10,000,022 - train_num_tokens: 9,874,772 - test_num_tokens: 125,250 - total_num_samples: 141,263 - train_num_samples: 139,851 - test_num_samples: 1,412 ### `100M` -- 1,028,484 samples -- 100,000,047 tokens - ratio_wikipedia: 100.00% - total_num_tokens: 100,000,047 - train_num_tokens: 99,013,372 - test_num_tokens: 986,675 - total_num_samples: 1,028,484 - train_num_samples: 1,018,200 - test_num_samples: 10,284 ### `1B` -- 5,153,898 samples -- 1,000,000,187 tokens - ratio_wikipedia: 61.21% - total_num_tokens: 1,000,000,187 - train_num_tokens: 989,990,190 - test_num_tokens: 10,009,997 - total_num_samples: 5,153,898 - train_num_samples: 5,102,360 - test_num_samples: 51,538 ### `5B` -- 20,833,009 samples -- 5,000,000,076 tokens - ratio_wikipedia: 25.35% - total_num_tokens: 5,000,000,076 - train_num_tokens: 4,984,493,654 - test_num_tokens: 15,506,422 - total_num_samples: 20,833,009 - train_num_samples: 20,769,009 - test_num_samples: 64,000 ### `10B` -- 40,240,566 samples -- 10,000,000,115 tokens - ratio_wikipedia: 18.41% - total_num_tokens: 10,000,000,115 - train_num_tokens: 9,984,156,828 - test_num_tokens: 15,843,287 - total_num_samples: 40,240,566 - train_num_samples: 40,176,566 - test_num_samples: 64,000 ### `15B` -- 59,648,123 samples -- 15,000,000,154 tokens - ratio_wikipedia: 15.98% - total_num_tokens: 15,000,000,154 - train_num_tokens: 14,983,970,518 - test_num_tokens: 16,029,636 - total_num_samples: 59,648,123 - train_num_samples: 59,584,123 - test_num_samples: 64,000 ### `20B` -- 79,055,679 samples -- 20,000,000,009 tokens - ratio_wikipedia: 14.75% - total_num_tokens: 20,000,000,009 - train_num_tokens: 19,983,799,357 - test_num_tokens: 16,200,652 - total_num_samples: 79,055,679 - train_num_samples: 78,991,679 - test_num_samples: 64,000 ### `25B` -- 98,463,236 samples -- 25,000,000,048 tokens - ratio_wikipedia: 14.00% - total_num_tokens: 25,000,000,048 - train_num_tokens: 24,983,765,326 - test_num_tokens: 16,234,722 - total_num_samples: 98,463,236 - train_num_samples: 98,399,236 - test_num_samples: 64,000 ### `30B` -- 117,870,793 samples -- 30,000,000,087 tokens - ratio_wikipedia: 13.50% - total_num_tokens: 30,000,000,087 - train_num_tokens: 29,983,707,932 - test_num_tokens: 16,292,155 - total_num_samples: 117,870,793 - train_num_samples: 117,806,793 - test_num_samples: 64,000 ### `35B` -- 137,278,350 samples -- 35,000,000,126 tokens - ratio_wikipedia: 13.14% - total_num_tokens: 35,000,000,126 - train_num_tokens: 34,983,914,739 - test_num_tokens: 16,085,387 - total_num_samples: 137,278,350 - train_num_samples: 137,214,350 - test_num_samples: 64,000 ### `40B` -- 156,685,907 samples -- 40,000,000,165 tokens - ratio_wikipedia: 12.87% - total_num_tokens: 40,000,000,165 - train_num_tokens: 39,983,508,625 - test_num_tokens: 16,491,540 - total_num_samples: 156,685,907 - train_num_samples: 156,621,907 - test_num_samples: 64,000 ### `45B` -- 176,093,463 samples -- 45,000,000,020 tokens - ratio_wikipedia: 12.66% - total_num_tokens: 45,000,000,020 - train_num_tokens: 44,983,608,118 - test_num_tokens: 16,391,902 - total_num_samples: 176,093,463 - train_num_samples: 176,029,463 - test_num_samples: 64,000 ### `50B` -- 195,501,020 samples -- 50,000,000,059 tokens - ratio_wikipedia: 12.49% - total_num_tokens: 50,000,000,059 - train_num_tokens: 49,983,567,461 - test_num_tokens: 16,432,598 - total_num_samples: 195,501,020 - train_num_samples: 195,437,020 - test_num_samples: 64,000 ### `55B` -- 214,908,577 samples -- 55,000,000,098 tokens - ratio_wikipedia: 12.35% - total_num_tokens: 55,000,000,098 - train_num_tokens: 54,983,723,278 - test_num_tokens: 16,276,820 - total_num_samples: 214,908,577 - train_num_samples: 214,844,577 - test_num_samples: 64,000 ## Filtering While CultruaX already has done a lot of filtering, some more filtering can be done to improve the quality of the corpus. These filters are described below. The baseline ratios (punctuation, uppercase, digits) were calculated on the SONAR-500 corpus (excluding WRPEA WRPED WRUEA WRUED WRUEB). **CulturaX**: - removed documents that contain the text "rechten voorbehouden" or "rights reserved" - remove documents whose URL contained "wikipedia.org" (because we include a cleaned version of Wikipedia ourselves) - removed documents that contain a "bad word" (see the section below) - removed documents that contain any non-latin characters. The idea is that "knowledge"-based information (e.g. original writing of a name) are allowed when the data comes from Wikipedia, but not from any other webcrawl, to avoid unsollicited noise. **CulturaX + Wikipedia**: - removed documents where ratio of punctuation marks vs. non-whitespace characters is higher than 0.2 - removed documents where ratio of uppercase vs. non-whitespace characters is higher than 0.22 - removed documents where ratio of digits vs. non-whitespace characters is higher than 0.16 - removed documents where the average token length is < 2 or > 20 ## Bad words ```python BAD_PHRASES_DOC_LEVEL = { # https://en.wikipedia.org/wiki/Dutch_profanity "achterlijk", "debiel", "downie", "idioot", "kankerlijer", "klere", "kolere", "minkukel", "pestkop", "pleuris", "pleuritis", "teringlijer", "tyfuslijer", "gadver", "getver", "godver", "godskolere", "godverork", "graftak", "kopvod", "verdomme", "anaalgeneraal", "bitch", "dikzak", "flikker", "fok", "fuck", "hoer", "klootzak", "klote", "kreng", "kringspiermusketier", "kut", "lamzak", "lul", "manwijf", "matennaai", "neuken", "neuker", "ouwehoer", "reet", "reetkever", "reetridder", "rotzak", "schijt", "shit", "slet", "slijmbal", "slons", "sodemieter", "stoephoer", "swaffel", "teef", "trut", "tut", "zak", "uilskuiken", "zeik", "bamivreter", "bosneger", "neger", "fransoos", "geitenneuker", "kaaskop", "kakker", "koelie", "lijp", "medelander", "mocro", "mof", "nikker", "poepchinees", "roetmop", "spaghettivreter", "loempiavouwer", "spanjool", "spleetoog", "tatta", "tokkie", "zandneger", "zwartzak", "halvezool", "kenau", "klootviool", "knuppel", "koekert", "koekwaus", "oelewapper", "smeerlap", "sukkel", "sul", "wappie", "wijf", "zooi", # xxx (a.o. https://gitlab.com/yhavinga/c4nlpreproc/-/blob/master/clean/badwords_ennl.py?ref_type=heads) "xxx", "anal", "blowjob", "buttplug", "cock", "cunt", "geil", "sex", # Standaardnederlands = seks, maybe we catch some porn or socialmedia sites with this misspelling "porn", # extra "nigger", "nigga", "hoerig", "klojo", } ``` ## Config details ## License information For CulturaX: https://huggingface.co/datasets/uonlp/CulturaX#license-information For Wikipedia: https://huggingface.co/datasets/wikimedia/wikipedia#licensing-information
提供机构:
BramVanroy
原始信息汇总

Filtered CulturaX + Wikipedia for Dutch

数据集概述

该数据集是CulturaXWikipedia的合并和过滤版本,仅包含荷兰语内容。适用于大型语言模型(LLMs)的训练。

配置详情

数据集提供多种配置,根据token数量划分,具体如下:

10k

  • 总样本数: 79
  • 总token数: 10,087
  • 训练集样本数: 78
  • 训练集token数: 9,205
  • 测试集样本数: 1
  • 测试集token数: 882

100k

  • 总样本数: 1,057
  • 总token数: 100,075
  • 训练集样本数: 1,047
  • 训练集token数: 98,044
  • 测试集样本数: 10
  • 测试集token数: 2,031

1M

  • 总样本数: 10,802
  • 总token数: 1,000,239
  • 训练集样本数: 10,694
  • 训练集token数: 991,119
  • 测试集样本数: 108
  • 测试集token数: 9,120

10M

  • 总样本数: 141,263
  • 总token数: 10,000,022
  • 训练集样本数: 139,851
  • 训练集token数: 9,874,772
  • 测试集样本数: 1,412
  • 测试集token数: 125,250

100M

  • 总样本数: 1,028,484
  • 总token数: 100,000,047
  • 训练集样本数: 1,018,200
  • 训练集token数: 99,013,372
  • 测试集样本数: 10,284
  • 测试集token数: 986,675

1B

  • 总样本数: 5,153,898
  • 总token数: 1,000,000,187
  • 训练集样本数: 5,102,360
  • 训练集token数: 989,990,190
  • 测试集样本数: 51,538
  • 测试集token数: 10,009,997

5B

  • 总样本数: 20,833,009
  • 总token数: 5,000,000,076
  • 训练集样本数: 20,769,009
  • 训练集token数: 4,984,493,654
  • 测试集样本数: 64,000
  • 测试集token数: 15,506,422

10B

  • 总样本数: 40,240,566
  • 总token数: 10,000,000,115
  • 训练集样本数: 40,176,566
  • 训练集token数: 9,984,156,828
  • 测试集样本数: 64,000
  • 测试集token数: 15,843,287

15B

  • 总样本数: 59,648,123
  • 总token数: 15,000,000,154
  • 训练集样本数: 59,584,123
  • 训练集token数: 14,983,970,518
  • 测试集样本数: 64,000
  • 测试集token数: 16,029,636

20B

  • 总样本数: 79,055,679
  • 总token数: 20,000,000,009
  • 训练集样本数: 78,991,679
  • 训练集token数: 19,983,799,357
  • 测试集样本数: 64,000
  • 测试集token数: 16,200,652

25B

  • 总样本数: 98,463,236
  • 总token数: 25,000,000,048
  • 训练集样本数: 98,399,236
  • 训练集token数: 24,983,765,326
  • 测试集样本数: 64,000
  • 测试集token数: 16,234,722

30B

  • 总样本数: 117,870,793
  • 总token数: 30,000,000,087
  • 训练集样本数: 117,806,793
  • 训练集token数: 29,983,707,932
  • 测试集样本数: 64,000
  • 测试集token数: 16,292,155

35B

  • 总样本数: 137,278,350
  • 总token数: 35,000,000,126
  • 训练集样本数: 137,214,350
  • 训练集token数: 34,983,914,739
  • 测试集样本数: 64,000
  • 测试集token数: 16,085,387

40B

  • 总样本数: 156,685,907
  • 总token数: 40,000,000,165
  • 训练集样本数: 156,621,907
  • 训练集token数: 39,983,508,625
  • 测试集样本数: 64,000
  • 测试集token数: 16,491,540

45B

  • 总样本数: 176,093,463
  • 总token数: 45,000,000,020
  • 训练集样本数: 176,029,463
  • 训练集token数: 44,983,608,118
  • 测试集样本数: 64,000
  • 测试集token数: 16,391,902

50B

  • 总样本数: 195,501,020
  • 总token数: 50,000,000,059
  • 训练集样本数: 195,437,020
  • 训练集token数: 49,983,567,461
  • 测试集样本数: 64,000
  • 测试集token数: 16,432,598

55B

  • 总样本数: 214,908,577
  • 总token数: 55,000,000,098
  • 训练集样本数: 214,844,577
  • 训练集token数: 54,983,723,278
  • 测试集样本数: 64,000
  • 测试集token数: 16,276,820

过滤规则

数据集进行了额外的过滤以提高语料质量,具体规则如下:

CulturaX过滤规则

  • 移除包含“rechten voorbehouden”或“rights reserved”的文档
  • 移除URL包含“wikipedia.org”的文档
  • 移除包含“bad word”的文档
  • 移除非拉丁字符的文档

CulturaX + Wikipedia过滤规则

  • 移除标点符号与非空白字符比例超过0.2的文档
  • 移除大写字母与非空白字符比例超过0.22的文档
  • 移除数字与非空白字符比例超过0.16的文档
  • 移除平均token长度小于2或大于20的文档

不良词汇

数据集移除了包含以下不良词汇的文档: python BAD_PHRASES_DOC_LEVEL = { "achterlijk", "debiel", "downie", "idioot", "kankerlijer", "klere", "kolere", "minkukel", "pestkop", "pleuris", "pleuritis", "teringlijer", "tyfuslijer", "gadver", "getver", "godver", "godskolere", "godverork", "graftak", "kopvod", "verdomme", "anaalgeneraal", "bitch", "dikzak", "flikker", "fok", "fuck", "hoer", "klootzak", "klote", "kreng", "kringspiermusketier", "kut", "lamzak", "lul", "manwijf", "matennaai", "neuken", "neuker", "ouwehoer", "reet", "reetkever", "reetridder", "rotzak", "schijt", "shit", "slet", "slijmbal", "slons", "sodemieter", "stoephoer", "swaffel", "teef", "trut", "tut", "zak", "uilskuiken", "zeik", "bamivreter", "bosneger", "neger", "fransoos", "geitenneuker", "kaaskop", "kakker", "koelie", "lijp", "medelander", "mocro", "mof", "nikker", "poepchinees", "roetmop", "spaghettivreter", "loempiavouwer", "spanjool", "spleetoog", "tatta", "tokkie", "zandneger", "zwartzak", "halvezool", "kenau", "klootviool", "knuppel", "koekert", "koekwaus", "oelewapper", "smeerlap", "sukkel", "sul", "wappie", "wijf", "zooi", "xxx", "anal", "blowjob", "buttplug", "cock", "cunt", "geil", "sex", "porn", "nigger", "nigga", "hoerig", "klojo" }

许可证信息

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作