olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547
收藏Hugging Face2023-02-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/olm/olm-CC-MAIN-2022-49-sampling-ratio-olm-0.15114822547
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language:
- en
language_creators:
- found
license: []
multilinguality:
- monolingual
pretty_name: OLM November/December 2022 Common Crawl
size_categories:
- 10M<n<100M
source_datasets: []
tags:
- pretraining
- language modelling
- common crawl
- web
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
---
# Dataset Card for OLM November/December 2022 Common Crawl
Cleaned and deduplicated pretraining dataset, created with the OLM repo [here](https://github.com/huggingface/olm-datasets) from 15% of the November/December 2022 Common Crawl snapshot.
Note: `last_modified_timestamp` was parsed from whatever a website returned in it's `Last-Modified` header; there are likely a small number of outliers that are incorrect, so we recommend removing the outliers before doing statistics with `last_modified_timestamp`.
提供机构:
olm
原始信息汇总
数据集概述
基本信息
- 名称: OLM November/December 2022 Common Crawl
- 语言: 英语
- 多语言性: 单语
- 许可证: 未指定
- 大小: 10M<n<100M
创建细节
- 语言创建者: 发现
- 注释创建者: 无注释
- 来源数据集: 无
任务与标签
- 任务类别:
- 文本生成
- 填空
- 任务ID:
- 语言建模
- 掩码语言建模
数据集用途
- 用途:
- 预训练
- 语言建模
- 常见爬虫
- 网络数据
数据集描述
- 描述: 该数据集是一个已清理和去重预训练数据集,由OLM仓库(https://github.com/huggingface/olm-datasets)从2022年11月/12月的Common Crawl快照的15%中创建。



