JorgeeGF/CCNet
收藏Hugging Face2024-04-18 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/JorgeeGF/CCNet
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是CCNet数据集的一个子集,专门为需要高质量网络爬取文本数据的研究者设计。数据集来源于Common Crawl项目,经过处理保留了高质量的文本内容和有价值的元数据。数据集包含4百万个数据点,每个数据点以压缩的JSON对象形式存储,采用JSONL格式。数据集适用于预训练语言模型、研究互联网文本以及其他需要多样化文本输入的NLP任务。
提供机构:
JorgeeGF
原始信息汇总
CCNet Reproduced Split (4M rows, 3.7B Tokens)
Overview
- Source: Common Crawl
- Purpose: Facilitate easier access and processing of high-quality, web-crawled text data for natural language processing tasks.
- Size: 4 million datapoints
Dataset Description
Data Collection
- Origin: Collected from web pages across diverse domains.
- Processing: Retains high-quality text contents with valuable metadata.
- Token Count: 3679227613 tokens (Mistral tokenizer)
Data Format
- Format: Newline-delimited JSONL (JSON Lines)
- Efficiency: Memory efficient for large datasets, allowing lazy parsing of data.
Fields
urldate_downloaddigestlengthnlinessource_domaintitleraw_contentoriginal_nlinesoriginal_lengthlanguagelanguage_scoreperplexity
Usage
- Suitable For: Pre-training language models, studying internet-based text, and other NLP tasks requiring diverse text inputs.
- Access: Load via Hugging Face Datasets library using the following Python code:
python from datasets import load_dataset
dataset = load_dataset("Jorgeegf/CCNet")



