OmBayus/turkgpt_dataset
收藏Hugging Face2024-05-31 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/OmBayus/turkgpt_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: data1
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61550128
num_examples: 10001
download_size: 32735291
dataset_size: 61550128
- config_name: data10
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59929776
num_examples: 10001
download_size: 31958468
dataset_size: 59929776
- config_name: data11
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61993909
num_examples: 10001
download_size: 32701409
dataset_size: 61993909
- config_name: data12
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60284789
num_examples: 10001
download_size: 32220771
dataset_size: 60284789
- config_name: data13
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61878232
num_examples: 10001
download_size: 32749735
dataset_size: 61878232
- config_name: data14
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 66594693
num_examples: 10001
download_size: 34993030
dataset_size: 66594693
- config_name: data15
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63150025
num_examples: 10001
download_size: 32953507
dataset_size: 63150025
- config_name: data16
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60484472
num_examples: 10001
download_size: 32513368
dataset_size: 60484472
- config_name: data17
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 67494389
num_examples: 10001
download_size: 34793278
dataset_size: 67494389
- config_name: data18
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62339874
num_examples: 10001
download_size: 33009623
dataset_size: 62339874
- config_name: data19
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60629042
num_examples: 10001
download_size: 32090817
dataset_size: 60629042
- config_name: data2
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61030628
num_examples: 10001
download_size: 32281304
dataset_size: 61030628
- config_name: data20
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63924588
num_examples: 10001
download_size: 34116787
dataset_size: 63924588
- config_name: data21
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63125007
num_examples: 10001
download_size: 32800366
dataset_size: 63125007
- config_name: data22
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60635052
num_examples: 10001
download_size: 32331430
dataset_size: 60635052
- config_name: data23
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63414668
num_examples: 10001
download_size: 33301249
dataset_size: 63414668
- config_name: data24
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59973694
num_examples: 10001
download_size: 31980512
dataset_size: 59973694
- config_name: data25
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60713828
num_examples: 10001
download_size: 31770660
dataset_size: 60713828
- config_name: data26
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60186412
num_examples: 10001
download_size: 32128174
dataset_size: 60186412
- config_name: data27
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62824691
num_examples: 10001
download_size: 33474263
dataset_size: 62824691
- config_name: data28
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60984865
num_examples: 10001
download_size: 32510815
dataset_size: 60984865
- config_name: data29
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64690071
num_examples: 10001
download_size: 33753848
dataset_size: 64690071
- config_name: data3
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63682549
num_examples: 10001
download_size: 33779451
dataset_size: 63682549
- config_name: data30
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62665600
num_examples: 10001
download_size: 33206968
dataset_size: 62665600
- config_name: data31
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59044250
num_examples: 10001
download_size: 31340118
dataset_size: 59044250
- config_name: data32
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61916362
num_examples: 10001
download_size: 32820540
dataset_size: 61916362
- config_name: data33
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59540747
num_examples: 10001
download_size: 31852440
dataset_size: 59540747
- config_name: data34
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64466602
num_examples: 10001
download_size: 33664217
dataset_size: 64466602
- config_name: data35
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61348344
num_examples: 10001
download_size: 32692090
dataset_size: 61348344
- config_name: data36
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59400095
num_examples: 10001
download_size: 31855963
dataset_size: 59400095
- config_name: data37
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61937184
num_examples: 10001
download_size: 32976451
dataset_size: 61937184
- config_name: data38
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62482735
num_examples: 10001
download_size: 33099220
dataset_size: 62482735
- config_name: data39
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63780004
num_examples: 10001
download_size: 33800479
dataset_size: 63780004
- config_name: data4
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62482740
num_examples: 10001
download_size: 33033871
dataset_size: 62482740
- config_name: data40
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64953475
num_examples: 10001
download_size: 33798062
dataset_size: 64953475
- config_name: data41
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62831211
num_examples: 10001
download_size: 33416355
dataset_size: 62831211
- config_name: data42
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63514725
num_examples: 10001
download_size: 33605969
dataset_size: 63514725
- config_name: data43
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61121820
num_examples: 10001
download_size: 32418829
dataset_size: 61121820
- config_name: data44
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62437178
num_examples: 10001
download_size: 33275805
dataset_size: 62437178
- config_name: data45
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63552953
num_examples: 10001
download_size: 33277579
dataset_size: 63552953
- config_name: data46
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61589087
num_examples: 10001
download_size: 32810886
dataset_size: 61589087
- config_name: data47
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59646094
num_examples: 10001
download_size: 31840286
dataset_size: 59646094
- config_name: data48
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62723780
num_examples: 10001
download_size: 32758173
dataset_size: 62723780
- config_name: data49
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61147516
num_examples: 10001
download_size: 32702926
dataset_size: 61147516
- config_name: data5
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63689965
num_examples: 10001
download_size: 33871247
dataset_size: 63689965
- config_name: data50
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61512893
num_examples: 10001
download_size: 32611271
dataset_size: 61512893
- config_name: data51
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61705464
num_examples: 10001
download_size: 32858117
dataset_size: 61705464
- config_name: data52
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61555065
num_examples: 10001
download_size: 32860283
dataset_size: 61555065
- config_name: data53
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62444873
num_examples: 10001
download_size: 33148348
dataset_size: 62444873
- config_name: data54
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 65564968
num_examples: 10001
download_size: 34296006
dataset_size: 65564968
- config_name: data55
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64146778
num_examples: 10001
download_size: 34473347
dataset_size: 64146778
- config_name: data56
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61163364
num_examples: 10001
download_size: 32439613
dataset_size: 61163364
- config_name: data57
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63140014
num_examples: 10001
download_size: 33855901
dataset_size: 63140014
- config_name: data58
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61741802
num_examples: 10001
download_size: 32621415
dataset_size: 61741802
- config_name: data59
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62879029
num_examples: 10001
download_size: 33018221
dataset_size: 62879029
- config_name: data6
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63910578
num_examples: 10001
download_size: 33841560
dataset_size: 63910578
- config_name: data60
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63569320
num_examples: 10001
download_size: 33332176
dataset_size: 63569320
- config_name: data61
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64146476
num_examples: 10001
download_size: 34275410
dataset_size: 64146476
- config_name: data62
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63551621
num_examples: 10001
download_size: 34185955
dataset_size: 63551621
- config_name: data63
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59413794
num_examples: 10001
download_size: 31803865
dataset_size: 59413794
- config_name: data64
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62791937
num_examples: 10001
download_size: 33288978
dataset_size: 62791937
- config_name: data65
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62531587
num_examples: 10001
download_size: 33080464
dataset_size: 62531587
- config_name: data66
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61203587
num_examples: 10001
download_size: 32510423
dataset_size: 61203587
- config_name: data67
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 65998149
num_examples: 10001
download_size: 34812969
dataset_size: 65998149
- config_name: data68
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62369191
num_examples: 10001
download_size: 32947002
dataset_size: 62369191
- config_name: data69
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62481955
num_examples: 10001
download_size: 33348487
dataset_size: 62481955
- config_name: data7
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60395612
num_examples: 10001
download_size: 32226488
dataset_size: 60395612
- config_name: data70
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63485907
num_examples: 10001
download_size: 33493531
dataset_size: 63485907
- config_name: data71
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61239866
num_examples: 10001
download_size: 32520049
dataset_size: 61239866
- config_name: data72
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61981518
num_examples: 10001
download_size: 32740696
dataset_size: 61981518
- config_name: data73
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64391172
num_examples: 10001
download_size: 34132048
dataset_size: 64391172
- config_name: data74
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62481521
num_examples: 10001
download_size: 33398263
dataset_size: 62481521
- config_name: data75
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62211691
num_examples: 10001
download_size: 33504718
dataset_size: 62211691
- config_name: data76
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64363166
num_examples: 10001
download_size: 33635281
dataset_size: 64363166
- config_name: data77
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61680297
num_examples: 10001
download_size: 32940699
dataset_size: 61680297
- config_name: data78
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62822874
num_examples: 10001
download_size: 33235449
dataset_size: 62822874
- config_name: data79
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60695144
num_examples: 10001
download_size: 32330859
dataset_size: 60695144
- config_name: data8
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64333820
num_examples: 10001
download_size: 34259347
dataset_size: 64333820
- config_name: data80
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62396175
num_examples: 10001
download_size: 33456534
dataset_size: 62396175
- config_name: data81
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62304438
num_examples: 10001
download_size: 33072014
dataset_size: 62304438
- config_name: data82
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60840189
num_examples: 10001
download_size: 32170622
dataset_size: 60840189
- config_name: data83
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59641230
num_examples: 10001
download_size: 31887045
dataset_size: 59641230
- config_name: data84
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61332005
num_examples: 10001
download_size: 32556575
dataset_size: 61332005
- config_name: data85
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61513004
num_examples: 10001
download_size: 32734754
dataset_size: 61513004
- config_name: data86
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64232108
num_examples: 10001
download_size: 34250871
dataset_size: 64232108
- config_name: data87
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61397066
num_examples: 10001
download_size: 32724271
dataset_size: 61397066
- config_name: data88
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64129874
num_examples: 10001
download_size: 33668112
dataset_size: 64129874
- config_name: data89
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60652111
num_examples: 10001
download_size: 32384035
dataset_size: 60652111
- config_name: data9
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 60110137
num_examples: 10001
download_size: 31944196
dataset_size: 60110137
- config_name: data90
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61524961
num_examples: 10001
download_size: 32654036
dataset_size: 61524961
- config_name: data91
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 63387196
num_examples: 10001
download_size: 33537871
dataset_size: 63387196
- config_name: data92
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62139108
num_examples: 10001
download_size: 33038580
dataset_size: 62139108
- config_name: data93
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 59949112
num_examples: 10001
download_size: 32211071
dataset_size: 59949112
- config_name: data94
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64914469
num_examples: 10001
download_size: 33815861
dataset_size: 64914469
- config_name: data95
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 64031504
num_examples: 10001
download_size: 33939704
dataset_size: 64031504
- config_name: data96
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62561518
num_examples: 10001
download_size: 33220152
dataset_size: 62561518
- config_name: data97
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62487975
num_examples: 10001
download_size: 33073081
dataset_size: 62487975
- config_name: data98
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 62743213
num_examples: 10001
download_size: 33174861
dataset_size: 62743213
- config_name: data99
features:
- name: id
dtype: int64
- name: text
dtype: string
- name: meta
struct:
- name: annotations
sequence: string
- name: identification
struct:
- name: label
dtype: string
- name: prob
dtype: float64
- name: line_identifications
list:
- name: label
dtype: string
- name: prob
dtype: float64
- name: warc_headers
struct:
- name: content-length
dtype: int64
- name: content-type
dtype: string
- name: warc-block-digest
dtype: string
- name: warc-date
dtype: string
- name: warc-identified-content-language
dtype: string
- name: warc-record-id
dtype: string
- name: warc-refers-to
dtype: string
- name: warc-target-uri
dtype: string
- name: warc-type
dtype: string
splits:
- name: train
num_bytes: 61145550
num_examples: 10001
download_size: 31945385
dataset_size: 61145550
configs:
- config_name: data1
data_files:
- split: train
path: data1/train-*
- config_name: data10
data_files:
- split: train
path: data10/train-*
- config_name: data11
data_files:
- split: train
path: data11/train-*
- config_name: data12
data_files:
- split: train
path: data12/train-*
- config_name: data13
data_files:
- split: train
path: data13/train-*
- config_name: data14
data_files:
- split: train
path: data14/train-*
- config_name: data15
data_files:
- split: train
path: data15/train-*
- config_name: data16
data_files:
- split: train
path: data16/train-*
- config_name: data17
data_files:
- split: train
path: data17/train-*
- config_name: data18
data_files:
- split: train
path: data18/train-*
- config_name: data19
data_files:
- split: train
path: data19/train-*
- config_name: data2
data_files:
- split: train
path: data2/train-*
- config_name: data20
data_files:
- split: train
path: data20/train-*
- config_name: data21
data_files:
- split: train
path: data21/train-*
- config_name: data22
data_files:
- split: train
path: data22/train-*
- config_name: data23
data_files:
- split: train
path: data23/train-*
- config_name: data24
data_files:
- split: train
path: data24/train-*
- config_name: data25
data_files:
- split: train
path: data25/train-*
- config_name: data26
data_files:
- split: train
path: data26/train-*
- config_name: data27
data_files:
- split: train
path: data27/train-*
- config_name: data28
data_files:
- split: train
path: data28/train-*
- config_name: data29
data_files:
- split: train
path: data29/train-*
- config_name: data3
data_files:
- split: train
path: data3/train-*
- config_name: data30
data_files:
- split: train
path: data30/train-*
- config_name: data31
data_files:
- split: train
path: data31/train-*
- config_name: data32
data_files:
- split: train
path: data32/train-*
- config_name: data33
data_files:
- split: train
path: data33/train-*
- config_name: data34
data_files:
- split: train
path: data34/train-*
- config_name: data35
data_files:
- split: train
path: data35/train-*
- config_name: data36
data_files:
- split: train
path: data36/train-*
- config_name: data37
data_files:
- split: train
path: data37/train-*
- config_name: data38
data_files:
- split: train
path: data38/train-*
- config_name: data39
data_files:
- split: train
path: data39/train-*
- config_name: data4
data_files:
- split: train
path: data4/train-*
- config_name: data40
data_files:
- split: train
path: data40/train-*
- config_name: data41
data_files:
- split: train
path: data41/train-*
- config_name: data42
data_files:
- split: train
path: data42/train-*
- config_name: data43
data_files:
- split: train
path: data43/train-*
- config_name: data44
data_files:
- split: train
path: data44/train-*
- config_name: data45
data_files:
- split: train
path: data45/train-*
- config_name: data46
data_files:
- split: train
path: data46/train-*
- config_name: data47
data_files:
- split: train
path: data47/train-*
- config_name: data48
data_files:
- split: train
path: data48/train-*
- config_name: data49
data_files:
- split: train
path: data49/train-*
- config_name: data5
data_files:
- split: train
path: data5/train-*
- config_name: data50
data_files:
- split: train
path: data50/train-*
- config_name: data51
data_files:
- split: train
path: data51/train-*
- config_name: data52
data_files:
- split: train
path: data52/train-*
- config_name: data53
data_files:
- split: train
path: data53/train-*
- config_name: data54
data_files:
- split: train
path: data54/train-*
- config_name: data55
data_files:
- split: train
path: data55/train-*
- config_name: data56
data_files:
- split: train
path: data56/train-*
- config_name: data57
data_files:
- split: train
path: data57/train-*
- config_name: data58
data_files:
- split: train
path: data58/train-*
- config_name: data59
data_files:
- split: train
path: data59/train-*
- config_name: data6
data_files:
- split: train
path: data6/train-*
- config_name: data60
data_files:
- split: train
path: data60/train-*
- config_name: data61
data_files:
- split: train
path: data61/train-*
- config_name: data62
data_files:
- split: train
path: data62/train-*
- config_name: data63
data_files:
- split: train
path: data63/train-*
- config_name: data64
data_files:
- split: train
path: data64/train-*
- config_name: data65
data_files:
- split: train
path: data65/train-*
- config_name: data66
data_files:
- split: train
path: data66/train-*
- config_name: data67
data_files:
- split: train
path: data67/train-*
- config_name: data68
data_files:
- split: train
path: data68/train-*
- config_name: data69
data_files:
- split: train
path: data69/train-*
- config_name: data7
data_files:
- split: train
path: data7/train-*
- config_name: data70
data_files:
- split: train
path: data70/train-*
- config_name: data71
data_files:
- split: train
path: data71/train-*
- config_name: data72
data_files:
- split: train
path: data72/train-*
- config_name: data73
data_files:
- split: train
path: data73/train-*
- config_name: data74
data_files:
- split: train
path: data74/train-*
- config_name: data75
data_files:
- split: train
path: data75/train-*
- config_name: data76
data_files:
- split: train
path: data76/train-*
- config_name: data77
data_files:
- split: train
path: data77/train-*
- config_name: data78
data_files:
- split: train
path: data78/train-*
- config_name: data79
data_files:
- split: train
path: data79/train-*
- config_name: data8
data_files:
- split: train
path: data8/train-*
- config_name: data80
data_files:
- split: train
path: data80/train-*
- config_name: data81
data_files:
- split: train
path: data81/train-*
- config_name: data82
data_files:
- split: train
path: data82/train-*
- config_name: data83
data_files:
- split: train
path: data83/train-*
- config_name: data84
data_files:
- split: train
path: data84/train-*
- config_name: data85
data_files:
- split: train
path: data85/train-*
- config_name: data86
data_files:
- split: train
path: data86/train-*
- config_name: data87
data_files:
- split: train
path: data87/train-*
- config_name: data88
data_files:
- split: train
path: data88/train-*
- config_name: data89
data_files:
- split: train
path: data89/train-*
- config_name: data9
data_files:
- split: train
path: data9/train-*
- config_name: data90
data_files:
- split: train
path: data90/train-*
- config_name: data91
data_files:
- split: train
path: data91/train-*
- config_name: data92
data_files:
- split: train
path: data92/train-*
- config_name: data93
data_files:
- split: train
path: data93/train-*
- config_name: data94
data_files:
- split: train
path: data94/train-*
- config_name: data95
data_files:
- split: train
path: data95/train-*
- config_name: data96
data_files:
- split: train
path: data96/train-*
- config_name: data97
data_files:
- split: train
path: data97/train-*
- config_name: data98
data_files:
- split: train
path: data98/train-*
- config_name: data99
data_files:
- split: train
path: data99/train-*
language:
- tr
- en
tags:
- turkgpt
---
This dataset includes multiple configurations such as data1, data10, data11, etc., each with features like id, text, and meta. The meta feature contains nested structures such as annotations, identification, line_identifications, and warc_headers. Each configuration also specifies a train split with details on the number of bytes and examples. The dataset sizes and download sizes are also provided for each configuration.
提供机构:
OmBayus
原始信息汇总
数据集概述
数据集配置信息
- config_name: data1, data10, data11, data12, data13, data14, data15, data16, data17, data18, data19, data2, data20, data21, data22, data23, data24, data25, data26
数据集特征
- id: 数据类型为 int64
- text: 数据类型为 string
- meta: 结构化数据,包含以下子特征:
- annotations: 序列类型,数据类型为 string
- identification: 结构化数据,包含:
- label: 数据类型为 string
- prob: 数据类型为 float64
- line_identifications: 列表类型,包含:
- label: 数据类型为 string
- prob: 数据类型为 float64
- warc_headers: 结构化数据,包含:
- content-length: 数据类型为 int64
- content-type: 数据类型为 string
- warc-block-digest: 数据类型为 string
- warc-date: 数据类型为 string
- warc-identified-content-language: 数据类型为 string
- warc-record-id: 数据类型为 string
- warc-refers-to: 数据类型为 string
- warc-target-uri: 数据类型为 string
- warc-type: 数据类型为 string
数据集分割
- train:
- num_bytes: 不同数据集配置的训练数据大小不同,范围从59929776到67494389字节。
- num_examples: 每个配置的训练集包含10001个示例。
数据集大小
- download_size: 不同数据集配置的下载大小不同,范围从31958468到34993030字节。
- dataset_size: 不同数据集配置的数据集大小不同,范围从59929776到67494389字节。



