JorgeeGF/CCNet

Name: JorgeeGF/CCNet
Creator: JorgeeGF
Published: 2024-04-18 23:23:51
License: 暂无描述

Hugging Face2024-04-18 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/JorgeeGF/CCNet

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是CCNet数据集的一个子集，专门为需要高质量网络爬取文本数据的研究者设计。数据集来源于Common Crawl项目，经过处理保留了高质量的文本内容和有价值的元数据。数据集包含4百万个数据点，每个数据点以压缩的JSON对象形式存储，采用JSONL格式。数据集适用于预训练语言模型、研究互联网文本以及其他需要多样化文本输入的NLP任务。

提供机构：

JorgeeGF

原始信息汇总

CCNet Reproduced Split (4M rows, 3.7B Tokens)

Overview

Source: Common Crawl
Purpose: Facilitate easier access and processing of high-quality, web-crawled text data for natural language processing tasks.
Size: 4 million datapoints

Dataset Description

Data Collection

Origin: Collected from web pages across diverse domains.
Processing: Retains high-quality text contents with valuable metadata.
Token Count: 3679227613 tokens (Mistral tokenizer)

Data Format

Format: Newline-delimited JSONL (JSON Lines)
Efficiency: Memory efficient for large datasets, allowing lazy parsing of data.

Fields

url
date_download
digest
length
nlines
source_domain
title
raw_content
original_nlines
original_length
language
language_score
perplexity

Usage

Suitable For: Pre-training language models, studying internet-based text, and other NLP tasks requiring diverse text inputs.
Access: Load via Hugging Face Datasets library using the following Python code:

python from datasets import load_dataset

dataset = load_dataset("Jorgeegf/CCNet")

5,000+

优质数据集

54 个

任务类型

进入经典数据集