five

olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547

收藏
Hugging Face2023-02-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language: - en language_creators: - found license: [] multilinguality: - monolingual pretty_name: OLM November/December 2022 Common Crawl size_categories: - 10M<n<100M source_datasets: [] tags: - pretraining - language modelling - common crawl - web task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling --- # Dataset Card for OLM November/December 2022 Common Crawl Cleaned and deduplicated pretraining dataset, created with the OLM repo [here](https://github.com/huggingface/olm-datasets) from 15% of the November/December 2022 Common Crawl snapshot. Note: `last_modified_timestamp` was parsed from whatever a website returned in it's `Last-Modified` header; there are likely a small number of outliers that are incorrect, so we recommend removing the outliers before doing statistics with `last_modified_timestamp`.
提供机构:
olm
原始信息汇总

数据集概述

基本信息

  • 名称: OLM November/December 2022 Common Crawl
  • 语言: 英语
  • 多语言性: 单语
  • 许可证: 未指定
  • 大小: 10M<n<100M

创建细节

  • 语言创建者: 发现
  • 注释创建者: 无注释
  • 来源数据集: 无

任务与标签

  • 任务类别:
    • 文本生成
    • 填空
  • 任务ID:
    • 语言建模
    • 掩码语言建模

数据集用途

  • 用途:
    • 预训练
    • 语言建模
    • 常见爬虫
    • 网络数据

数据集描述

二维码
社区交流群
二维码
科研交流群
商业服务