five

BEE-spoke-data/TxT360-500k-sample-no_cc

收藏
Hugging Face2024-10-11 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/TxT360-500k-sample-no_cc
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - de - ja - fr - es - it - cs - ar - pl - ru license: odc-by size_categories: - 100K<n<1M task_categories: - text-generation - feature-extraction dataset_info: features: - name: text dtype: string - name: meta struct: - name: corpusid dtype: int64 - name: dup_signals struct: - name: dup_details struct: - name: 2013-20 dtype: int64 - name: 2013-48 dtype: int64 - name: 2014-10 dtype: int64 - name: 2014-15 dtype: int64 - name: 2014-23 dtype: int64 - name: 2014-35 dtype: int64 - name: 2014-41 dtype: int64 - name: 2014-42 dtype: int64 - name: 2014-49 dtype: int64 - name: 2014-52 dtype: int64 - name: 2015-06 dtype: int64 - name: 2015-11 dtype: int64 - name: 2015-14 dtype: int64 - name: 2015-18 dtype: int64 - name: 2015-22 dtype: int64 - name: 2015-27 dtype: int64 - name: 2015-32 dtype: int64 - name: 2015-35 dtype: int64 - name: 2015-40 dtype: int64 - name: 2015-48 dtype: int64 - name: 2016-07 dtype: int64 - name: 2016-18 dtype: int64 - name: 2016-22 dtype: int64 - name: 2016-26 dtype: int64 - name: 2016-30 dtype: int64 - name: 2016-36 dtype: int64 - name: 2016-40 dtype: int64 - name: 2016-44 dtype: int64 - name: 2016-50 dtype: int64 - name: 2017-04 dtype: int64 - name: 2017-09 dtype: int64 - name: 2017-13 dtype: int64 - name: 2017-17 dtype: int64 - name: 2017-22 dtype: int64 - name: 2017-26 dtype: int64 - name: 2017-30 dtype: int64 - name: 2017-34 dtype: int64 - name: 2017-39 dtype: int64 - name: 2017-43 dtype: int64 - name: 2017-47 dtype: int64 - name: 2017-51 dtype: int64 - name: 2018-05 dtype: int64 - name: 2018-09 dtype: int64 - name: 2018-13 dtype: int64 - name: 2018-17 dtype: int64 - name: 2018-22 dtype: int64 - name: 2018-26 dtype: int64 - name: 2018-30 dtype: int64 - name: 2018-34 dtype: int64 - name: 2018-39 dtype: int64 - name: 2018-43 dtype: int64 - name: 2018-47 dtype: int64 - name: 2018-51 dtype: int64 - name: 2019-04 dtype: int64 - name: 2019-09 dtype: int64 - name: 2019-13 dtype: int64 - name: 2019-18 dtype: int64 - name: 2019-22 dtype: int64 - name: 2019-26 dtype: int64 - name: 2019-30 dtype: int64 - name: 2019-35 dtype: int64 - name: 2019-39 dtype: int64 - name: 2019-43 dtype: int64 - name: 2019-47 dtype: int64 - name: 2019-51 dtype: int64 - name: 2020-05 dtype: int64 - name: 2020-10 dtype: int64 - name: 2020-16 dtype: int64 - name: 2020-24 dtype: int64 - name: 2020-29 dtype: int64 - name: 2020-34 dtype: int64 - name: 2020-40 dtype: int64 - name: 2020-45 dtype: int64 - name: 2020-50 dtype: int64 - name: 2021-04 dtype: int64 - name: 2021-10 dtype: int64 - name: 2021-17 dtype: int64 - name: 2021-21 dtype: int64 - name: 2021-25 dtype: int64 - name: 2021-31 dtype: int64 - name: 2021-39 dtype: int64 - name: 2021-43 dtype: int64 - name: 2021-49 dtype: int64 - name: 2022-05 dtype: int64 - name: 2022-21 dtype: int64 - name: 2022-27 dtype: int64 - name: 2022-33 dtype: int64 - name: 2022-40 dtype: int64 - name: 2022-49 dtype: int64 - name: 2023-06 dtype: int64 - name: 2023-14 dtype: int64 - name: 2023-23 dtype: int64 - name: 2023-40 dtype: int64 - name: 2023-50 dtype: int64 - name: 2024-10 dtype: int64 - name: 2024-18 dtype: int64 - name: 2024-22 dtype: int64 - name: 2024-26 dtype: int64 - name: 2024-30 dtype: int64 - name: curated_sources dtype: int64 - name: unknown dtype: int64 - name: dup_doc_count dtype: int64 - name: dup_dump_count dtype: int64 - name: file dtype: string - name: id dtype: int64 - name: language dtype: string - name: openaccessinfo struct: - name: externalids struct: - name: ACL dtype: string - name: ArXiv dtype: string - name: DOI dtype: string - name: MAG dtype: string - name: PubMedCentral dtype: string - name: license dtype: string - name: status dtype: string - name: url dtype: string - name: pmid dtype: int64 - name: source struct: - name: oainfo struct: - name: license dtype: string - name: openaccessurl dtype: string - name: status dtype: string - name: pdfsha dtype: string - name: pdfurls sequence: string - name: title dtype: string - name: url dtype: string - name: subset dtype: string - name: lang dtype: string splits: - name: train num_bytes: 4076258147 num_examples: 500000 download_size: 1977526595 dataset_size: 4076258147 configs: - config_name: default data_files: - split: train path: data/train-* --- # BEE-spoke-data/TxT360-500k-sample-no_cc no common crawl
提供机构:
BEE-spoke-data
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作