olm/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204
收藏Hugging Face2022-11-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/olm/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language:
- en
language_creators:
- found
license: []
multilinguality:
- monolingual
pretty_name: OLM May 2022 Common Crawl
size_categories:
- 10M<n<100M
source_datasets: []
tags:
- pretraining
- language modelling
- common crawl
- web
task_categories: []
task_ids: []
---
# Dataset Card for OLM May 2022 Common Crawl
Cleaned and deduplicated pretraining dataset, created with the OLM repo [here](https://github.com/huggingface/olm-datasets) from 15% of the May 2022 Common Crawl snapshot.
Note: `last_modified_timestamp` was parsed from whatever a website returned in it's `Last-Modified` header; there are likely a small number of outliers that are incorrect, so we recommend removing the outliers before doing statistics with `last_modified_timestamp`.
提供机构:
olm
原始信息汇总
数据集卡片 OLM May 2022 Common Crawl
基本信息
- 名称: OLM May 2022 Common Crawl
- 语言: 英语
- 语言创建者: 发现
- 多语言性: 单语种
- 大小类别: 10M<n<100M
- 标签: 预训练, 语言建模, Common Crawl, 网络
描述
- 来源: 从2022年5月的Common Crawl快照中提取的15%数据,使用OLM仓库创建。
- 处理: 数据经过清洗和去重处理。
注意事项
- 时间戳:
last_modified_timestamp是从网站的Last-Modified头解析的,可能存在少量不正确的异常值,建议在进行统计前移除这些异常值。



