five

OmBayus/turkgpt_dataset

收藏
Hugging Face2024-05-31 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/OmBayus/turkgpt_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: data1 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61550128 num_examples: 10001 download_size: 32735291 dataset_size: 61550128 - config_name: data10 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59929776 num_examples: 10001 download_size: 31958468 dataset_size: 59929776 - config_name: data11 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61993909 num_examples: 10001 download_size: 32701409 dataset_size: 61993909 - config_name: data12 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60284789 num_examples: 10001 download_size: 32220771 dataset_size: 60284789 - config_name: data13 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61878232 num_examples: 10001 download_size: 32749735 dataset_size: 61878232 - config_name: data14 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 66594693 num_examples: 10001 download_size: 34993030 dataset_size: 66594693 - config_name: data15 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63150025 num_examples: 10001 download_size: 32953507 dataset_size: 63150025 - config_name: data16 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60484472 num_examples: 10001 download_size: 32513368 dataset_size: 60484472 - config_name: data17 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 67494389 num_examples: 10001 download_size: 34793278 dataset_size: 67494389 - config_name: data18 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62339874 num_examples: 10001 download_size: 33009623 dataset_size: 62339874 - config_name: data19 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60629042 num_examples: 10001 download_size: 32090817 dataset_size: 60629042 - config_name: data2 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61030628 num_examples: 10001 download_size: 32281304 dataset_size: 61030628 - config_name: data20 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63924588 num_examples: 10001 download_size: 34116787 dataset_size: 63924588 - config_name: data21 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63125007 num_examples: 10001 download_size: 32800366 dataset_size: 63125007 - config_name: data22 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60635052 num_examples: 10001 download_size: 32331430 dataset_size: 60635052 - config_name: data23 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63414668 num_examples: 10001 download_size: 33301249 dataset_size: 63414668 - config_name: data24 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59973694 num_examples: 10001 download_size: 31980512 dataset_size: 59973694 - config_name: data25 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60713828 num_examples: 10001 download_size: 31770660 dataset_size: 60713828 - config_name: data26 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60186412 num_examples: 10001 download_size: 32128174 dataset_size: 60186412 - config_name: data27 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62824691 num_examples: 10001 download_size: 33474263 dataset_size: 62824691 - config_name: data28 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60984865 num_examples: 10001 download_size: 32510815 dataset_size: 60984865 - config_name: data29 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64690071 num_examples: 10001 download_size: 33753848 dataset_size: 64690071 - config_name: data3 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63682549 num_examples: 10001 download_size: 33779451 dataset_size: 63682549 - config_name: data30 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62665600 num_examples: 10001 download_size: 33206968 dataset_size: 62665600 - config_name: data31 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59044250 num_examples: 10001 download_size: 31340118 dataset_size: 59044250 - config_name: data32 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61916362 num_examples: 10001 download_size: 32820540 dataset_size: 61916362 - config_name: data33 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59540747 num_examples: 10001 download_size: 31852440 dataset_size: 59540747 - config_name: data34 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64466602 num_examples: 10001 download_size: 33664217 dataset_size: 64466602 - config_name: data35 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61348344 num_examples: 10001 download_size: 32692090 dataset_size: 61348344 - config_name: data36 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59400095 num_examples: 10001 download_size: 31855963 dataset_size: 59400095 - config_name: data37 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61937184 num_examples: 10001 download_size: 32976451 dataset_size: 61937184 - config_name: data38 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62482735 num_examples: 10001 download_size: 33099220 dataset_size: 62482735 - config_name: data39 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63780004 num_examples: 10001 download_size: 33800479 dataset_size: 63780004 - config_name: data4 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62482740 num_examples: 10001 download_size: 33033871 dataset_size: 62482740 - config_name: data40 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64953475 num_examples: 10001 download_size: 33798062 dataset_size: 64953475 - config_name: data41 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62831211 num_examples: 10001 download_size: 33416355 dataset_size: 62831211 - config_name: data42 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63514725 num_examples: 10001 download_size: 33605969 dataset_size: 63514725 - config_name: data43 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61121820 num_examples: 10001 download_size: 32418829 dataset_size: 61121820 - config_name: data44 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62437178 num_examples: 10001 download_size: 33275805 dataset_size: 62437178 - config_name: data45 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63552953 num_examples: 10001 download_size: 33277579 dataset_size: 63552953 - config_name: data46 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61589087 num_examples: 10001 download_size: 32810886 dataset_size: 61589087 - config_name: data47 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59646094 num_examples: 10001 download_size: 31840286 dataset_size: 59646094 - config_name: data48 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62723780 num_examples: 10001 download_size: 32758173 dataset_size: 62723780 - config_name: data49 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61147516 num_examples: 10001 download_size: 32702926 dataset_size: 61147516 - config_name: data5 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63689965 num_examples: 10001 download_size: 33871247 dataset_size: 63689965 - config_name: data50 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61512893 num_examples: 10001 download_size: 32611271 dataset_size: 61512893 - config_name: data51 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61705464 num_examples: 10001 download_size: 32858117 dataset_size: 61705464 - config_name: data52 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61555065 num_examples: 10001 download_size: 32860283 dataset_size: 61555065 - config_name: data53 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62444873 num_examples: 10001 download_size: 33148348 dataset_size: 62444873 - config_name: data54 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 65564968 num_examples: 10001 download_size: 34296006 dataset_size: 65564968 - config_name: data55 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64146778 num_examples: 10001 download_size: 34473347 dataset_size: 64146778 - config_name: data56 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61163364 num_examples: 10001 download_size: 32439613 dataset_size: 61163364 - config_name: data57 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63140014 num_examples: 10001 download_size: 33855901 dataset_size: 63140014 - config_name: data58 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61741802 num_examples: 10001 download_size: 32621415 dataset_size: 61741802 - config_name: data59 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62879029 num_examples: 10001 download_size: 33018221 dataset_size: 62879029 - config_name: data6 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63910578 num_examples: 10001 download_size: 33841560 dataset_size: 63910578 - config_name: data60 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63569320 num_examples: 10001 download_size: 33332176 dataset_size: 63569320 - config_name: data61 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64146476 num_examples: 10001 download_size: 34275410 dataset_size: 64146476 - config_name: data62 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63551621 num_examples: 10001 download_size: 34185955 dataset_size: 63551621 - config_name: data63 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59413794 num_examples: 10001 download_size: 31803865 dataset_size: 59413794 - config_name: data64 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62791937 num_examples: 10001 download_size: 33288978 dataset_size: 62791937 - config_name: data65 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62531587 num_examples: 10001 download_size: 33080464 dataset_size: 62531587 - config_name: data66 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61203587 num_examples: 10001 download_size: 32510423 dataset_size: 61203587 - config_name: data67 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 65998149 num_examples: 10001 download_size: 34812969 dataset_size: 65998149 - config_name: data68 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62369191 num_examples: 10001 download_size: 32947002 dataset_size: 62369191 - config_name: data69 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62481955 num_examples: 10001 download_size: 33348487 dataset_size: 62481955 - config_name: data7 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60395612 num_examples: 10001 download_size: 32226488 dataset_size: 60395612 - config_name: data70 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63485907 num_examples: 10001 download_size: 33493531 dataset_size: 63485907 - config_name: data71 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61239866 num_examples: 10001 download_size: 32520049 dataset_size: 61239866 - config_name: data72 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61981518 num_examples: 10001 download_size: 32740696 dataset_size: 61981518 - config_name: data73 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64391172 num_examples: 10001 download_size: 34132048 dataset_size: 64391172 - config_name: data74 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62481521 num_examples: 10001 download_size: 33398263 dataset_size: 62481521 - config_name: data75 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62211691 num_examples: 10001 download_size: 33504718 dataset_size: 62211691 - config_name: data76 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64363166 num_examples: 10001 download_size: 33635281 dataset_size: 64363166 - config_name: data77 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61680297 num_examples: 10001 download_size: 32940699 dataset_size: 61680297 - config_name: data78 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62822874 num_examples: 10001 download_size: 33235449 dataset_size: 62822874 - config_name: data79 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60695144 num_examples: 10001 download_size: 32330859 dataset_size: 60695144 - config_name: data8 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64333820 num_examples: 10001 download_size: 34259347 dataset_size: 64333820 - config_name: data80 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62396175 num_examples: 10001 download_size: 33456534 dataset_size: 62396175 - config_name: data81 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62304438 num_examples: 10001 download_size: 33072014 dataset_size: 62304438 - config_name: data82 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60840189 num_examples: 10001 download_size: 32170622 dataset_size: 60840189 - config_name: data83 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59641230 num_examples: 10001 download_size: 31887045 dataset_size: 59641230 - config_name: data84 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61332005 num_examples: 10001 download_size: 32556575 dataset_size: 61332005 - config_name: data85 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61513004 num_examples: 10001 download_size: 32734754 dataset_size: 61513004 - config_name: data86 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64232108 num_examples: 10001 download_size: 34250871 dataset_size: 64232108 - config_name: data87 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61397066 num_examples: 10001 download_size: 32724271 dataset_size: 61397066 - config_name: data88 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64129874 num_examples: 10001 download_size: 33668112 dataset_size: 64129874 - config_name: data89 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60652111 num_examples: 10001 download_size: 32384035 dataset_size: 60652111 - config_name: data9 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 60110137 num_examples: 10001 download_size: 31944196 dataset_size: 60110137 - config_name: data90 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61524961 num_examples: 10001 download_size: 32654036 dataset_size: 61524961 - config_name: data91 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 63387196 num_examples: 10001 download_size: 33537871 dataset_size: 63387196 - config_name: data92 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62139108 num_examples: 10001 download_size: 33038580 dataset_size: 62139108 - config_name: data93 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 59949112 num_examples: 10001 download_size: 32211071 dataset_size: 59949112 - config_name: data94 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64914469 num_examples: 10001 download_size: 33815861 dataset_size: 64914469 - config_name: data95 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 64031504 num_examples: 10001 download_size: 33939704 dataset_size: 64031504 - config_name: data96 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62561518 num_examples: 10001 download_size: 33220152 dataset_size: 62561518 - config_name: data97 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62487975 num_examples: 10001 download_size: 33073081 dataset_size: 62487975 - config_name: data98 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 62743213 num_examples: 10001 download_size: 33174861 dataset_size: 62743213 - config_name: data99 features: - name: id dtype: int64 - name: text dtype: string - name: meta struct: - name: annotations sequence: string - name: identification struct: - name: label dtype: string - name: prob dtype: float64 - name: line_identifications list: - name: label dtype: string - name: prob dtype: float64 - name: warc_headers struct: - name: content-length dtype: int64 - name: content-type dtype: string - name: warc-block-digest dtype: string - name: warc-date dtype: string - name: warc-identified-content-language dtype: string - name: warc-record-id dtype: string - name: warc-refers-to dtype: string - name: warc-target-uri dtype: string - name: warc-type dtype: string splits: - name: train num_bytes: 61145550 num_examples: 10001 download_size: 31945385 dataset_size: 61145550 configs: - config_name: data1 data_files: - split: train path: data1/train-* - config_name: data10 data_files: - split: train path: data10/train-* - config_name: data11 data_files: - split: train path: data11/train-* - config_name: data12 data_files: - split: train path: data12/train-* - config_name: data13 data_files: - split: train path: data13/train-* - config_name: data14 data_files: - split: train path: data14/train-* - config_name: data15 data_files: - split: train path: data15/train-* - config_name: data16 data_files: - split: train path: data16/train-* - config_name: data17 data_files: - split: train path: data17/train-* - config_name: data18 data_files: - split: train path: data18/train-* - config_name: data19 data_files: - split: train path: data19/train-* - config_name: data2 data_files: - split: train path: data2/train-* - config_name: data20 data_files: - split: train path: data20/train-* - config_name: data21 data_files: - split: train path: data21/train-* - config_name: data22 data_files: - split: train path: data22/train-* - config_name: data23 data_files: - split: train path: data23/train-* - config_name: data24 data_files: - split: train path: data24/train-* - config_name: data25 data_files: - split: train path: data25/train-* - config_name: data26 data_files: - split: train path: data26/train-* - config_name: data27 data_files: - split: train path: data27/train-* - config_name: data28 data_files: - split: train path: data28/train-* - config_name: data29 data_files: - split: train path: data29/train-* - config_name: data3 data_files: - split: train path: data3/train-* - config_name: data30 data_files: - split: train path: data30/train-* - config_name: data31 data_files: - split: train path: data31/train-* - config_name: data32 data_files: - split: train path: data32/train-* - config_name: data33 data_files: - split: train path: data33/train-* - config_name: data34 data_files: - split: train path: data34/train-* - config_name: data35 data_files: - split: train path: data35/train-* - config_name: data36 data_files: - split: train path: data36/train-* - config_name: data37 data_files: - split: train path: data37/train-* - config_name: data38 data_files: - split: train path: data38/train-* - config_name: data39 data_files: - split: train path: data39/train-* - config_name: data4 data_files: - split: train path: data4/train-* - config_name: data40 data_files: - split: train path: data40/train-* - config_name: data41 data_files: - split: train path: data41/train-* - config_name: data42 data_files: - split: train path: data42/train-* - config_name: data43 data_files: - split: train path: data43/train-* - config_name: data44 data_files: - split: train path: data44/train-* - config_name: data45 data_files: - split: train path: data45/train-* - config_name: data46 data_files: - split: train path: data46/train-* - config_name: data47 data_files: - split: train path: data47/train-* - config_name: data48 data_files: - split: train path: data48/train-* - config_name: data49 data_files: - split: train path: data49/train-* - config_name: data5 data_files: - split: train path: data5/train-* - config_name: data50 data_files: - split: train path: data50/train-* - config_name: data51 data_files: - split: train path: data51/train-* - config_name: data52 data_files: - split: train path: data52/train-* - config_name: data53 data_files: - split: train path: data53/train-* - config_name: data54 data_files: - split: train path: data54/train-* - config_name: data55 data_files: - split: train path: data55/train-* - config_name: data56 data_files: - split: train path: data56/train-* - config_name: data57 data_files: - split: train path: data57/train-* - config_name: data58 data_files: - split: train path: data58/train-* - config_name: data59 data_files: - split: train path: data59/train-* - config_name: data6 data_files: - split: train path: data6/train-* - config_name: data60 data_files: - split: train path: data60/train-* - config_name: data61 data_files: - split: train path: data61/train-* - config_name: data62 data_files: - split: train path: data62/train-* - config_name: data63 data_files: - split: train path: data63/train-* - config_name: data64 data_files: - split: train path: data64/train-* - config_name: data65 data_files: - split: train path: data65/train-* - config_name: data66 data_files: - split: train path: data66/train-* - config_name: data67 data_files: - split: train path: data67/train-* - config_name: data68 data_files: - split: train path: data68/train-* - config_name: data69 data_files: - split: train path: data69/train-* - config_name: data7 data_files: - split: train path: data7/train-* - config_name: data70 data_files: - split: train path: data70/train-* - config_name: data71 data_files: - split: train path: data71/train-* - config_name: data72 data_files: - split: train path: data72/train-* - config_name: data73 data_files: - split: train path: data73/train-* - config_name: data74 data_files: - split: train path: data74/train-* - config_name: data75 data_files: - split: train path: data75/train-* - config_name: data76 data_files: - split: train path: data76/train-* - config_name: data77 data_files: - split: train path: data77/train-* - config_name: data78 data_files: - split: train path: data78/train-* - config_name: data79 data_files: - split: train path: data79/train-* - config_name: data8 data_files: - split: train path: data8/train-* - config_name: data80 data_files: - split: train path: data80/train-* - config_name: data81 data_files: - split: train path: data81/train-* - config_name: data82 data_files: - split: train path: data82/train-* - config_name: data83 data_files: - split: train path: data83/train-* - config_name: data84 data_files: - split: train path: data84/train-* - config_name: data85 data_files: - split: train path: data85/train-* - config_name: data86 data_files: - split: train path: data86/train-* - config_name: data87 data_files: - split: train path: data87/train-* - config_name: data88 data_files: - split: train path: data88/train-* - config_name: data89 data_files: - split: train path: data89/train-* - config_name: data9 data_files: - split: train path: data9/train-* - config_name: data90 data_files: - split: train path: data90/train-* - config_name: data91 data_files: - split: train path: data91/train-* - config_name: data92 data_files: - split: train path: data92/train-* - config_name: data93 data_files: - split: train path: data93/train-* - config_name: data94 data_files: - split: train path: data94/train-* - config_name: data95 data_files: - split: train path: data95/train-* - config_name: data96 data_files: - split: train path: data96/train-* - config_name: data97 data_files: - split: train path: data97/train-* - config_name: data98 data_files: - split: train path: data98/train-* - config_name: data99 data_files: - split: train path: data99/train-* language: - tr - en tags: - turkgpt ---

This dataset includes multiple configurations such as data1, data10, data11, etc., each with features like id, text, and meta. The meta feature contains nested structures such as annotations, identification, line_identifications, and warc_headers. Each configuration also specifies a train split with details on the number of bytes and examples. The dataset sizes and download sizes are also provided for each configuration.
提供机构:
OmBayus
原始信息汇总

数据集概述

数据集配置信息

  • config_name: data1, data10, data11, data12, data13, data14, data15, data16, data17, data18, data19, data2, data20, data21, data22, data23, data24, data25, data26

数据集特征

  • id: 数据类型为 int64
  • text: 数据类型为 string
  • meta: 结构化数据,包含以下子特征:
    • annotations: 序列类型,数据类型为 string
    • identification: 结构化数据,包含:
      • label: 数据类型为 string
      • prob: 数据类型为 float64
    • line_identifications: 列表类型,包含:
      • label: 数据类型为 string
      • prob: 数据类型为 float64
    • warc_headers: 结构化数据,包含:
      • content-length: 数据类型为 int64
      • content-type: 数据类型为 string
      • warc-block-digest: 数据类型为 string
      • warc-date: 数据类型为 string
      • warc-identified-content-language: 数据类型为 string
      • warc-record-id: 数据类型为 string
      • warc-refers-to: 数据类型为 string
      • warc-target-uri: 数据类型为 string
      • warc-type: 数据类型为 string

数据集分割

  • train:
    • num_bytes: 不同数据集配置的训练数据大小不同,范围从59929776到67494389字节。
    • num_examples: 每个配置的训练集包含10001个示例。

数据集大小

  • download_size: 不同数据集配置的下载大小不同,范围从31958468到34993030字节。
  • dataset_size: 不同数据集配置的数据集大小不同,范围从59929776到67494389字节。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作