five

olm/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204

收藏
Hugging Face2022-11-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/olm/olm-CC-MAIN-2022-21-sampling-ratio-0.14775510204
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language: - en language_creators: - found license: [] multilinguality: - monolingual pretty_name: OLM May 2022 Common Crawl size_categories: - 10M<n<100M source_datasets: [] tags: - pretraining - language modelling - common crawl - web task_categories: [] task_ids: [] --- # Dataset Card for OLM May 2022 Common Crawl Cleaned and deduplicated pretraining dataset, created with the OLM repo [here](https://github.com/huggingface/olm-datasets) from 15% of the May 2022 Common Crawl snapshot. Note: `last_modified_timestamp` was parsed from whatever a website returned in it's `Last-Modified` header; there are likely a small number of outliers that are incorrect, so we recommend removing the outliers before doing statistics with `last_modified_timestamp`.
提供机构:
olm
原始信息汇总

数据集卡片 OLM May 2022 Common Crawl

基本信息

  • 名称: OLM May 2022 Common Crawl
  • 语言: 英语
  • 语言创建者: 发现
  • 多语言性: 单语种
  • 大小类别: 10M<n<100M
  • 标签: 预训练, 语言建模, Common Crawl, 网络

描述

  • 来源: 从2022年5月的Common Crawl快照中提取的15%数据,使用OLM仓库创建。
  • 处理: 数据经过清洗和去重处理。

注意事项

  • 时间戳: last_modified_timestamp是从网站的Last-Modified头解析的,可能存在少量不正确的异常值,建议在进行统计前移除这些异常值。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作