dododo1234/fineweb
收藏Hugging Face2024-05-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/dododo1234/fineweb
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
dataset_info:
- config_name: sample10bt-0
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979947523
num_examples: 290008
download_size: 590161503
dataset_size: 979947523
- config_name: sample10bt-10150280
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 993074330
num_examples: 290008
download_size: 596140547
dataset_size: 993074330
- config_name: sample10bt-10440288
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 988913778
num_examples: 290008
download_size: 594008653
dataset_size: 988913778
- config_name: sample10bt-10730296
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 992590689
num_examples: 290008
download_size: 596255798
dataset_size: 992590689
- config_name: sample10bt-11020304
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1000994216
num_examples: 290008
download_size: 600446360
dataset_size: 1000994216
- config_name: sample10bt-11310312
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1003260462
num_examples: 290008
download_size: 602303904
dataset_size: 1003260462
- config_name: sample10bt-1160032
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 984660083
num_examples: 290008
download_size: 593684839
dataset_size: 984660083
- config_name: sample10bt-11600320
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1003701977
num_examples: 290008
download_size: 602486295
dataset_size: 1003701977
- config_name: sample10bt-11890328
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1005174566
num_examples: 290008
download_size: 603592910
dataset_size: 1005174566
- config_name: sample10bt-12180336
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1001479206
num_examples: 290008
download_size: 601526731
dataset_size: 1001479206
- config_name: sample10bt-12470344
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 997656177
num_examples: 290008
download_size: 599795270
dataset_size: 997656177
- config_name: sample10bt-12760352
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 995369007
num_examples: 290008
download_size: 598308150
dataset_size: 995369007
- config_name: sample10bt-13050360
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 995696120
num_examples: 290008
download_size: 598678925
dataset_size: 995696120
- config_name: sample10bt-13340368
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 985424533
num_examples: 290008
download_size: 593220864
dataset_size: 985424533
- config_name: sample10bt-13630376
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 986689008
num_examples: 290008
download_size: 594972485
dataset_size: 986689008
- config_name: sample10bt-13920384
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 983042536
num_examples: 290008
download_size: 592682782
dataset_size: 983042536
- config_name: sample10bt-14210392
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 974595216
num_examples: 290008
download_size: 587924013
dataset_size: 974595216
- config_name: sample10bt-1450040
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979367659
num_examples: 290008
download_size: 590238554
dataset_size: 979367659
- config_name: sample10bt-1740048
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 980964151
num_examples: 290008
download_size: 591922978
dataset_size: 980964151
- config_name: sample10bt-2030056
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979597732
num_examples: 290008
download_size: 591009136
dataset_size: 979597732
- config_name: sample10bt-2320064
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 977164918
num_examples: 290008
download_size: 589114003
dataset_size: 977164918
- config_name: sample10bt-2610072
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979829645
num_examples: 290008
download_size: 590546192
dataset_size: 979829645
- config_name: sample10bt-290008
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 977396405
num_examples: 290008
download_size: 589367664
dataset_size: 977396405
- config_name: sample10bt-2900080
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 980144418
num_examples: 290008
download_size: 591075940
dataset_size: 980144418
- config_name: sample10bt-3190088
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 981462351
num_examples: 290008
download_size: 592293366
dataset_size: 981462351
- config_name: sample10bt-3480096
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 981505785
num_examples: 290008
download_size: 591388473
dataset_size: 981505785
- config_name: sample10bt-3770104
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 990661398
num_examples: 290008
download_size: 596809014
dataset_size: 990661398
- config_name: sample10bt-4060112
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 986106212
num_examples: 290008
download_size: 594370141
dataset_size: 986106212
- config_name: sample10bt-4350120
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 980994956
num_examples: 290008
download_size: 591229442
dataset_size: 980994956
- config_name: sample10bt-4640128
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 986543200
num_examples: 290008
download_size: 593865155
dataset_size: 986543200
- config_name: sample10bt-4930136
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979453845
num_examples: 290008
download_size: 589568442
dataset_size: 979453845
- config_name: sample10bt-5220144
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 981062571
num_examples: 290008
download_size: 590870288
dataset_size: 981062571
- config_name: sample10bt-5510152
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 981563924
num_examples: 290008
download_size: 591012138
dataset_size: 981563924
- config_name: sample10bt-580016
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 973592252
num_examples: 290008
download_size: 587456628
dataset_size: 973592252
- config_name: sample10bt-5800160
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 979367823
num_examples: 290008
download_size: 589524522
dataset_size: 979367823
- config_name: sample10bt-6090168
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 986494575
num_examples: 290008
download_size: 593063121
dataset_size: 986494575
- config_name: sample10bt-6380176
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 985063594
num_examples: 290008
download_size: 592892070
dataset_size: 985063594
- config_name: sample10bt-6670184
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 984476301
num_examples: 290008
download_size: 592780129
dataset_size: 984476301
- config_name: sample10bt-6960192
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 980895711
num_examples: 290008
download_size: 590544189
dataset_size: 980895711
- config_name: sample10bt-7250200
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 987200399
num_examples: 290008
download_size: 593909623
dataset_size: 987200399
- config_name: sample10bt-7540208
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 984340891
num_examples: 290008
download_size: 592165422
dataset_size: 984340891
- config_name: sample10bt-7830216
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 982026628
num_examples: 290008
download_size: 591036590
dataset_size: 982026628
- config_name: sample10bt-8120224
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 985505505
num_examples: 290008
download_size: 592868517
dataset_size: 985505505
- config_name: sample10bt-8410232
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 994961977
num_examples: 290008
download_size: 597775294
dataset_size: 994961977
- config_name: sample10bt-870024
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 969199330
num_examples: 290008
download_size: 584786536
dataset_size: 969199330
- config_name: sample10bt-8700240
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 996668327
num_examples: 290008
download_size: 598511850
dataset_size: 996668327
- config_name: sample10bt-8990248
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 994382609
num_examples: 290008
download_size: 597511094
dataset_size: 994382609
- config_name: sample10bt-9280256
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 993179135
num_examples: 290008
download_size: 597223946
dataset_size: 993179135
- config_name: sample10bt-9570264
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 990235851
num_examples: 290008
download_size: 595184122
dataset_size: 990235851
- config_name: sample10bt-9860272
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 991242320
num_examples: 290008
download_size: 594897638
dataset_size: 991242320
configs:
- config_name: sample10bt-0
data_files:
- split: train
path: sample10bt-0/train-*
- config_name: sample10bt-10150280
data_files:
- split: train
path: sample10bt-10150280/train-*
- config_name: sample10bt-10440288
data_files:
- split: train
path: sample10bt-10440288/train-*
- config_name: sample10bt-10730296
data_files:
- split: train
path: sample10bt-10730296/train-*
- config_name: sample10bt-11020304
data_files:
- split: train
path: sample10bt-11020304/train-*
- config_name: sample10bt-11310312
data_files:
- split: train
path: sample10bt-11310312/train-*
- config_name: sample10bt-1160032
data_files:
- split: train
path: sample10bt-1160032/train-*
- config_name: sample10bt-11600320
data_files:
- split: train
path: sample10bt-11600320/train-*
- config_name: sample10bt-11890328
data_files:
- split: train
path: sample10bt-11890328/train-*
- config_name: sample10bt-12180336
data_files:
- split: train
path: sample10bt-12180336/train-*
- config_name: sample10bt-12470344
data_files:
- split: train
path: sample10bt-12470344/train-*
- config_name: sample10bt-12760352
data_files:
- split: train
path: sample10bt-12760352/train-*
- config_name: sample10bt-13050360
data_files:
- split: train
path: sample10bt-13050360/train-*
- config_name: sample10bt-13340368
data_files:
- split: train
path: sample10bt-13340368/train-*
- config_name: sample10bt-13630376
data_files:
- split: train
path: sample10bt-13630376/train-*
- config_name: sample10bt-13920384
data_files:
- split: train
path: sample10bt-13920384/train-*
- config_name: sample10bt-14210392
data_files:
- split: train
path: sample10bt-14210392/train-*
- config_name: sample10bt-1450040
data_files:
- split: train
path: sample10bt-1450040/train-*
- config_name: sample10bt-1740048
data_files:
- split: train
path: sample10bt-1740048/train-*
- config_name: sample10bt-2030056
data_files:
- split: train
path: sample10bt-2030056/train-*
- config_name: sample10bt-2320064
data_files:
- split: train
path: sample10bt-2320064/train-*
- config_name: sample10bt-2610072
data_files:
- split: train
path: sample10bt-2610072/train-*
- config_name: sample10bt-290008
data_files:
- split: train
path: sample10bt-290008/train-*
- config_name: sample10bt-2900080
data_files:
- split: train
path: sample10bt-2900080/train-*
- config_name: sample10bt-3190088
data_files:
- split: train
path: sample10bt-3190088/train-*
- config_name: sample10bt-3480096
data_files:
- split: train
path: sample10bt-3480096/train-*
- config_name: sample10bt-3770104
data_files:
- split: train
path: sample10bt-3770104/train-*
- config_name: sample10bt-4060112
data_files:
- split: train
path: sample10bt-4060112/train-*
- config_name: sample10bt-4350120
data_files:
- split: train
path: sample10bt-4350120/train-*
- config_name: sample10bt-4640128
data_files:
- split: train
path: sample10bt-4640128/train-*
- config_name: sample10bt-4930136
data_files:
- split: train
path: sample10bt-4930136/train-*
- config_name: sample10bt-5220144
data_files:
- split: train
path: sample10bt-5220144/train-*
- config_name: sample10bt-5510152
data_files:
- split: train
path: sample10bt-5510152/train-*
- config_name: sample10bt-580016
data_files:
- split: train
path: sample10bt-580016/train-*
- config_name: sample10bt-5800160
data_files:
- split: train
path: sample10bt-5800160/train-*
- config_name: sample10bt-6090168
data_files:
- split: train
path: sample10bt-6090168/train-*
- config_name: sample10bt-6380176
data_files:
- split: train
path: sample10bt-6380176/train-*
- config_name: sample10bt-6670184
data_files:
- split: train
path: sample10bt-6670184/train-*
- config_name: sample10bt-6960192
data_files:
- split: train
path: sample10bt-6960192/train-*
- config_name: sample10bt-7250200
data_files:
- split: train
path: sample10bt-7250200/train-*
- config_name: sample10bt-7540208
data_files:
- split: train
path: sample10bt-7540208/train-*
- config_name: sample10bt-7830216
data_files:
- split: train
path: sample10bt-7830216/train-*
- config_name: sample10bt-8120224
data_files:
- split: train
path: sample10bt-8120224/train-*
- config_name: sample10bt-8410232
data_files:
- split: train
path: sample10bt-8410232/train-*
- config_name: sample10bt-870024
data_files:
- split: train
path: sample10bt-870024/train-*
- config_name: sample10bt-8700240
data_files:
- split: train
path: sample10bt-8700240/train-*
- config_name: sample10bt-8990248
data_files:
- split: train
path: sample10bt-8990248/train-*
- config_name: sample10bt-9280256
data_files:
- split: train
path: sample10bt-9280256/train-*
- config_name: sample10bt-9570264
data_files:
- split: train
path: sample10bt-9570264/train-*
- config_name: sample10bt-9860272
data_files:
- split: train
path: sample10bt-9860272/train-*
---
The dataset includes multiple configurations, each with the same set of features: text, id, dump, url, date, file_path, language, language_score, and token_count. All configurations have only one training split, each containing 290008 samples. The dataset is primarily used for English text processing.
提供机构:
dododo1234
原始信息汇总
数据集概述
数据集配置信息
- config_name: 多个配置名称,如
sample10bt-0,sample10bt-10150280,sample10bt-10440288等。
数据集特征
- 名称: 包括
text,id,dump,url,date,file_path,language,language_score,token_count。 - 数据类型: 主要为
string,float64,int64。
数据集分割
- 分割名称: 均为
train。 - 大小信息: 每个配置的
train分割包含的num_bytes和num_examples均为98亿字节和290008个样本。
数据集大小
- 下载大小: 不同配置的下载大小在59亿字节左右。
- 数据集大小: 不同配置的数据集大小在98亿字节左右。
数据集详细信息
配置sample10bt-0
- 特征: 同上。
- 分割:
train分割的num_bytes为979947523字节,num_examples为290008。 - 大小:
download_size为590161503字节,dataset_size为979947523字节。
配置sample10bt-10150280
- 特征: 同上。
- 分割:
train分割的num_bytes为993074330字节,num_examples为290008。 - 大小:
download_size为596140547字节,dataset_size为993074330字节。
配置sample10bt-10440288
- 特征: 同上。
- 分割:
train分割的num_bytes为988913778字节,num_examples为290008。 - 大小:
download_size为594008653字节,dataset_size为988913778字节。
配置sample10bt-10730296
- 特征: 同上。
- 分割:
train分割的num_bytes为992590689字节,num_examples为290008。 - 大小:
download_size为596255798字节,dataset_size为992590689字节。
配置sample10bt-11020304
- 特征: 同上。
- 分割:
train分割的num_bytes为1000994216字节,num_examples为290008。 - 大小:
download_size为600446360字节,dataset_size为1000994216字节。
配置sample10bt-11310312
- 特征: 同上。
- 分割:
train分割的num_bytes为1003260462字节,num_examples为290008。 - 大小:
download_size为602303904字节,dataset_size为1003260462字节。
配置sample10bt-1160032
- 特征: 同上。
- 分割:
train分割的num_bytes为984660083字节,num_examples为290008。 - 大小:
download_size为593684839字节,dataset_size为984660083字节。
配置sample10bt-11600320
- 特征: 同上。
- 分割:
train分割的num_bytes为1003701977字节,num_examples为290008。 - 大小:
download_size为602486295字节,dataset_size为1003701977字节。
配置sample10bt-11890328
- 特征: 同上。
- 分割:
train分割的num_bytes为1005174566字节,num_examples为290008。 - 大小:
download_size为603592910字节,dataset_size为1005174566字节。
配置sample10bt-12180336
- 特征: 同上。
- 分割:
train分割的num_bytes为1001479206字节,num_examples为290008。 - 大小:
download_size为601526731字节,dataset_size为1001479206字节。
配置sample10bt-12470344
- 特征: 同上。
- 分割:
train分割的num_bytes为997656177字节,num_examples为290008。 - 大小:
download_size为599795270字节,dataset_size为997656177字节。
配置sample10bt-12760352
- 特征: 同上。
- 分割:
train分割的num_bytes为995369007字节,num_examples为290008。 - 大小:
download_size为598308150字节,dataset_size为995369007字节。
配置sample10bt-13050360
- 特征: 同上。
- 分割:
train分割的num_bytes为995696120字节,num_examples为290008。 - 大小:
download_size为598678925字节,dataset_size为995696120字节。
配置sample10bt-13340368
- 特征: 同上。
- 分割:
train分割的num_bytes为985424533字节,num_examples为290008。 - 大小:
download_size为593220864字节,dataset_size为985424533字节。
配置sample10bt-13630376
- 特征: 同上。
- 分割:
train分割的num_bytes为986689008字节,num_examples为290008。 - 大小:
download_size为594972485字节,dataset_size为986689008字节。
配置sample10bt-13920384
- 特征: 同上。
- 分割:
train分割的num_bytes为983042536字节,num_examples为290008。 - 大小:
download_size为592682782字节,dataset_size为983042536字节。
配置sample10bt-14210392
- 特征: 同上。
- 分割:
train分割的num_bytes为974595216字节,num_examples为290008。 - 大小:
download_size为587924013字节,dataset_size为974595216字节。
配置sample10bt-1450040
- 特征: 同上。
- 分割:
train分割的num_bytes为979367659字节,num_examples为290008。 - 大小:
download_size为590238554字节,dataset_size为979367659字节。
配置sample10bt-1740048
- 特征: 同上。
- 分割:
train分割的num_bytes为980964151字节,num_examples为290008。 - 大小:
download_size为591922978字节,dataset_size为980964151字节。
配置sample10bt-2030056
- 特征: 同上。
- 分割:
train分割的num_bytes为979597732字节,num_examples为290008。 - 大小:
download_size为591009136字节,dataset_size为979597732字节。
配置sample10bt-2320064
- 特征: 同上。
- 分割:
train分割的num_bytes为977164918字节,num_examples为290008。 - 大小:
download_size为589114003字节,dataset_size为977164918字节。
配置sample10bt-2610072
- 特征: 同上。
- 分割:
train分割的num_bytes为979829645字节,num_examples为290008。 - 大小:
download_size为590546192字节,dataset_size为979829645字节。
配置sample10bt-290008
- 特征: 同上。
- 分割:
train分割的num_bytes为977396405字节,num_examples为290008。 - 大小:
download_size为589367664字节,dataset_size为977396405字节。
配置sample10bt-2900080
- 特征: 同上。
- 分割:
train分割的num_bytes为980144418字节,num_examples为290008。 - 大小:
download_size为591075940字节,dataset_size为980144418字节。
配置sample10bt-3190088
- 特征: 同上。
- 分割:
train分割的num_bytes为981462351字节,num_examples为290008。 - 大小:
download_size为592293366字节,dataset_size为981462351字节。
配置sample10bt-3480096
- 特征: 同上。
- 分割:
train分割的num_bytes为981505785字节,num_examples为290008。 - 大小:
download_size为591388473字节,dataset_size为981505785字节。
配置sample10bt-3770104
- 特征: 同上。
- 分割:
train分割的num_bytes为990661398字节,num_examples为290008。 - 大小:
download_size为596809014字节,dataset_size为990661398字节。
配置sample10bt-4060112
- 特征: 同上。
- 分割:
train分割的num_bytes为986106212字节,num_examples为290008。 - 大小:
download_size为594370141字节,dataset_size为986106212字节。
配置sample10bt-4350120
- 特征: 同上。
- 分割:
train分割的num_bytes为980994956字节,num_examples为290008。 - 大小:
download_size为591229442字节,dataset_size为980994956字节。
配置sample10bt-4640128
- 特征: 同上。
- 分割:
train分割的num_bytes为986543200字节,num_examples为290008。 - 大小:
download_size为593865155字节,dataset_size为986543200字节。
配置sample10bt-4930136
- 特征: 同上。
- 分割:
train分割的num_bytes为979453845字节,num_examples为290008。 - 大小:
download_size为589568442字节,dataset_size为979453845字节。
配置sample10bt-5220144
- 特征: 同上。
- 分割:
train分割的num_bytes为981062571字节,num_examples为290008。 - 大小:
download_size为590870288字节,dataset_size为981062571字节。
配置sample10bt-5510152
- 特征: 同上。
- 分割:
train分割的num_bytes为981563924字节,num_examples为290008。 - 大小:
download_size为591012138字节,dataset_size为981563924字节。
配置sample10bt-580016
- 特征: 同上。
- 分割:
train分割的num_bytes为973592252字节,num_examples为290008。 - 大小:
download_size为587456628字节,dataset_size为973592252字节。
配置sample10bt-5800160
- 特征: 同上。
- 分割:
train分割的num_bytes为979367823字节,num_examples为290008。 - 大小:
download_size为58



