open-index/fineweb-nlp
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/open-index/fineweb-nlp
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
- feature-extraction
- text-classification
language:
- en
pretty_name: "FineWeb NLP"
size_categories:
- 10B<n<100B
tags:
- parquet
- fineweb
- nlp
- sentences
- paragraphs
- words
- ngrams
- english
configs:
- config_name: sentences
data_files:
- split: train
path: data/sentences/**/*.parquet
- config_name: paragraphs
data_files:
- split: train
path: data/paragraphs/**/*.parquet
- config_name: words
data_files:
- split: train
path: data/words/**/*.parquet
- config_name: ngrams
data_files:
- split: train
path: data/ngrams/**/*.parquet
- config_name: sentences-CC-MAIN-2016-18
data_files:
- split: train
path: data/sentences/CC-MAIN-2016-18/*.parquet
- config_name: paragraphs-CC-MAIN-2016-18
data_files:
- split: train
path: data/paragraphs/CC-MAIN-2016-18/*.parquet
- config_name: words-CC-MAIN-2016-18
data_files:
- split: train
path: data/words/CC-MAIN-2016-18/*.parquet
- config_name: ngrams-CC-MAIN-2016-18
data_files:
- split: train
path: data/ngrams/CC-MAIN-2016-18/*.parquet
- config_name: sentences-CC-MAIN-2015-40
data_files:
- split: train
path: data/sentences/CC-MAIN-2015-40/*.parquet
- config_name: paragraphs-CC-MAIN-2015-40
data_files:
- split: train
path: data/paragraphs/CC-MAIN-2015-40/*.parquet
- config_name: words-CC-MAIN-2015-40
data_files:
- split: train
path: data/words/CC-MAIN-2015-40/*.parquet
- config_name: ngrams-CC-MAIN-2015-40
data_files:
- split: train
path: data/ngrams/CC-MAIN-2015-40/*.parquet
- config_name: sentences-CC-MAIN-2016-26
data_files:
- split: train
path: data/sentences/CC-MAIN-2016-26/*.parquet
- config_name: paragraphs-CC-MAIN-2016-26
data_files:
- split: train
path: data/paragraphs/CC-MAIN-2016-26/*.parquet
- config_name: words-CC-MAIN-2016-26
data_files:
- split: train
path: data/words/CC-MAIN-2016-26/*.parquet
- config_name: ngrams-CC-MAIN-2016-26
data_files:
- split: train
path: data/ngrams/CC-MAIN-2016-26/*.parquet
- config_name: sentences-CC-MAIN-2015-18
data_files:
- split: train
path: data/sentences/CC-MAIN-2015-18/*.parquet
- config_name: paragraphs-CC-MAIN-2015-18
data_files:
- split: train
path: data/paragraphs/CC-MAIN-2015-18/*.parquet
- config_name: words-CC-MAIN-2015-18
data_files:
- split: train
path: data/words/CC-MAIN-2015-18/*.parquet
- config_name: ngrams-CC-MAIN-2015-18
data_files:
- split: train
path: data/ngrams/CC-MAIN-2015-18/*.parquet
---
# FineWeb NLP
**14,465,384,769 sentences** and **7,251,855,857 paragraphs** from **444,665,356 English documents** (848.6 GB source data) in [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Every sentence, paragraph, word frequency, and n-gram frequency, split with language-aware segmentation and continuously updated.
## Table of Contents
- [What is this?](#what-is-this)
- [What is being released?](#what-is-being-released)
- [Data organization](#data-organization)
- [Sentence distribution by crawl](#sentence-distribution-by-crawl)
- [Paragraph distribution by crawl](#paragraph-distribution-by-crawl)
- [Splitting quality overview](#splitting-quality-overview)
- [How to download and use this dataset](#how-to-download-and-use-this-dataset)
- [Dataset statistics](#dataset-statistics)
- [How it works](#how-it-works)
- [Splitting methodology](#splitting-methodology)
- [Dataset card](#dataset-card)
---
## What is this?
[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is HuggingFace's
curated English web text corpus. It contains approximately 25.9 billion documents
totaling 48.6 TB of text and 18.5 trillion tokens, drawn from 110 Common Crawl
snapshots spanning 2013 to 2025. The text has been filtered for quality using
Gopher filters, C4 filters, and FineWeb-specific heuristics, then deduplicated
per crawl using MinHash.
Working directly with FineWeb requires downloading and processing tens of terabytes
of parquet files. Most researchers need just the sentences, or just the word
frequencies, or just a specific crawl period. They should not have to process
the entire corpus to get there.
**FineWeb NLP** solves this by pre-segmenting every document in FineWeb into four
linguistically useful units:
| Type | Rows | What you get |
|------|------|-------------|
| **sentences** | 14,465,384,769 | One row per sentence, with source document ID, URL, and position index |
| **paragraphs** | 7,251,855,857 | One row per paragraph, with sentence count per paragraph |
| **words** | 974,348,267 | Per-shard word frequency and document frequency tables |
| **ngrams** | 51,023,752,730 | Per-shard bigram through 5-gram frequency tables |
Every row traces back to its source document through `doc_id` and `doc_url` fields.
The `dump` field identifies which Common Crawl snapshot the document came from,
allowing temporal analysis of language use across a decade of web content.
### Why per-shard frequency tables?
Words and n-grams are computed **per source shard** rather than aggregated into a
single global table. FineWeb has 27,468 source shards, and building a single global
frequency table would require holding billions of unique entries in memory
simultaneously. By keeping frequencies per-shard, each output file stays small and
self-contained.
Aggregation is straightforward. A single DuckDB query can combine all shards in
seconds:
```sql
SELECT word, sum(frequency) as total_freq, sum(doc_frequency) as total_doc_freq
FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet'
GROUP BY word ORDER BY total_freq DESC LIMIT 100;
```
## What is being released?
Four dataset configs, all stored as Snappy-compressed Parquet files:
### 1. Sentences (`config_name: sentences`)
| Column | Type | Description |
|--------|------|-------------|
| `sentence` | string | The extracted sentence |
| `doc_id` | string | Source document UUID from FineWeb |
| `doc_url` | string | Original web page URL |
| `position` | int32 | 0-based sentence index within the document |
| `length` | int32 | Sentence length in UTF-8 bytes (equal to `LENGTH(sentence)`) |
| `dump` | string | Common Crawl dump (e.g. `CC-MAIN-2024-10`) |
### 2. Paragraphs (`config_name: paragraphs`)
| Column | Type | Description |
|--------|------|-------------|
| `paragraph` | string | The paragraph text |
| `doc_id` | string | Source document UUID |
| `doc_url` | string | Original web page URL |
| `position` | int32 | 0-based paragraph index within the document |
| `length` | int32 | Paragraph length in UTF-8 bytes (equal to `LENGTH(paragraph)`) |
| `dump` | string | Common Crawl dump |
| `sentence_count` | int32 | Number of sentences detected in this paragraph |
### 3. Words (`config_name: words`)
| Column | Type | Description |
|--------|------|-------------|
| `word` | string | Lowercased, NFC-normalized word |
| `frequency` | int64 | Occurrence count within this shard |
| `doc_frequency` | int64 | Documents containing this word (within shard) |
| `dump` | string | Common Crawl dump |
### 4. N-grams (`config_name: ngrams`)
| Column | Type | Description |
|--------|------|-------------|
| `ngram` | string | Space-joined n-gram (e.g. "of the", "in the world") |
| `n` | int32 | N-gram size: 2 (bigram), 3 (trigram), 4, or 5 |
| `frequency` | int64 | Occurrence count within this shard |
| `dump` | string | Common Crawl dump |
## Data organization
```
open-index/fineweb-nlp/
├── README.md
├── stats.csv
└── data/
├── sentences/
│ ├── CC-MAIN-2024-10/
│ │ ├── 000_00000.parquet
│ │ └── ...
│ └── {dump}/{shard}.parquet
├── paragraphs/
│ └── {dump}/{shard}.parquet
├── words/
│ └── {dump}/{shard}.parquet
└── ngrams/
└── {dump}/{shard}.parquet
```
Each source FineWeb shard maps to exactly one output file per type per dump.
Shard names match the source file names (e.g. `000_00000`, `005_00049`).
## Sentence distribution by crawl
```
CC-MAIN-2016-18 ████████████████████████████████████████ 5,051,263,821
CC-MAIN-2015-40 ██████████████████████████████████████ 4,832,015,199
CC-MAIN-2016-26 ██████████████████████████████████ 4,383,722,027
CC-MAIN-2015-18 █ 198,383,722
```
<details>
<summary>SQL to reproduce this chart</summary>
```sql
SELECT dump, count(*) as sentences
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet'
GROUP BY dump ORDER BY sentences DESC LIMIT 30;
```
</details>
## Paragraph distribution by crawl
```
CC-MAIN-2016-18 ████████████████████████████████████████ 2,542,010,464
CC-MAIN-2015-40 █████████████████████████████████████ 2,410,792,807
CC-MAIN-2016-26 ██████████████████████████████████ 2,198,498,865
CC-MAIN-2015-18 █ 100,553,721
```
<details>
<summary>SQL to reproduce this chart</summary>
```sql
SELECT dump, count(*) as paragraphs
FROM 'hf://datasets/open-index/fineweb-nlp/data/paragraphs/**/*.parquet'
GROUP BY dump ORDER BY paragraphs DESC LIMIT 20;
```
</details>
## Splitting quality overview
```
CC-MAIN-2016-26 ████████████████████████████████████████ 33.3
CC-MAIN-2015-40 ██████████████████████████████████████ 32.2
CC-MAIN-2016-18 ██████████████████████████████████████ 32.2
CC-MAIN-2015-18 ████████████████████████████████████ 30.7
```
The chart above shows the average number of sentences extracted per source document
for each crawl snapshot. This metric serves as a rough proxy for content quality and
structural richness. Crawls where the average is high tend to contain longer,
well-structured articles with clear paragraph and sentence boundaries. Crawls with
lower averages typically have shorter source documents or contain more boilerplate
content that was not fully filtered during FineWeb's quality filtering stage.
## How to download and use this dataset
### 1. DuckDB (recommended for exploration)
DuckDB can query HuggingFace parquet files directly over HTTP without downloading
anything to disk. This makes it the fastest way to explore the dataset.
```sql
-- Count sentences per crawl dump
SELECT dump, count(*) as sentences
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet'
GROUP BY dump ORDER BY sentences DESC;
-- Read sentences from a specific crawl
SELECT sentence, doc_url
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet'
LIMIT 20;
-- Top 100 most frequent English words
SELECT word, sum(frequency) as total_freq
FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet'
GROUP BY word ORDER BY total_freq DESC LIMIT 100;
-- Most common bigrams
SELECT ngram, sum(frequency) as total_freq
FROM 'hf://datasets/open-index/fineweb-nlp/data/ngrams/**/*.parquet'
WHERE n = 2
GROUP BY ngram ORDER BY total_freq DESC LIMIT 50;
-- Average sentences per document per crawl
SELECT dump,
count(DISTINCT doc_id) as docs,
count(*) as sentences,
round(count(*) * 1.0 / count(DISTINCT doc_id), 1) as avg_sent_per_doc
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet'
GROUP BY dump ORDER BY sentences DESC LIMIT 20;
-- Find sentences containing a specific phrase
SELECT sentence, doc_url, dump
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet'
WHERE sentence ILIKE '%artificial intelligence%'
LIMIT 20;
-- Word frequency trends across crawls
SELECT dump, sum(frequency) as freq
FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet'
WHERE word = 'ai'
GROUP BY dump ORDER BY dump;
```
### 2. Python (datasets library)
```python
from datasets import load_dataset
# Stream all sentences (no full download needed)
ds = load_dataset("open-index/fineweb-nlp", "sentences", split="train", streaming=True)
for row in ds.take(10):
print(f"[{row['dump']}] {row['sentence'][:100]}")
# Load paragraphs for a specific crawl
ds = load_dataset("open-index/fineweb-nlp", "paragraphs-CC-MAIN-2024-10", split="train", streaming=True)
# Word frequencies
ds = load_dataset("open-index/fineweb-nlp", "words", split="train", streaming=True)
for row in ds.take(20):
print(f"{row['word']:20s} freq={row['frequency']:>12,} doc_freq={row['doc_frequency']:>8,}")
# N-gram analysis
ds = load_dataset("open-index/fineweb-nlp", "ngrams", split="train", streaming=True)
bigrams = (row for row in ds if row["n"] == 2)
```
### 3. huggingface_hub CLI
```bash
# Download sentences from one crawl
huggingface-cli download open-index/fineweb-nlp --include "data/sentences/CC-MAIN-2024-10/*" --repo-type dataset
# Download all word frequencies
huggingface-cli download open-index/fineweb-nlp --include "data/words/**/*" --repo-type dataset
# Download everything for one crawl
huggingface-cli download open-index/fineweb-nlp --include "data/*/CC-MAIN-2024-10/*" --repo-type dataset
```
### 4. pandas + DuckDB
```python
import duckdb
conn = duckdb.connect()
# Sentences as DataFrame
df = conn.sql("""
SELECT sentence, doc_url, position
FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet'
LIMIT 1000
""").df()
print(f"Loaded {len(df):,} sentences")
print(df.head(10))
# Word frequency analysis
words_df = conn.sql("""
SELECT word, sum(frequency) as total_freq
FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet'
GROUP BY word ORDER BY total_freq DESC LIMIT 200
""").df()
print(words_df)
```
## Dataset statistics
| Metric | Value |
|--------|-------|
| **Total sentences** | **14,465,384,769** |
| **Total paragraphs** | **7,251,855,857** |
| **Unique word entries** (per-shard) | 974,348,267 |
| **Total n-gram entries** (per-shard) | 51,023,752,730 |
| **Crawl dumps processed** | **4** |
| **Source documents** | **444,665,356** |
| **Source data processed** | **848.6 GB** |
| **Output parquet size** | **2.3 TB** |
| Avg sentence length | 94.9 chars |
| Avg paragraph length | 189.5 chars |
| Avg sentences per document | 32.5 |
| Avg paragraphs per document | 16.3 |
| Avg sentences per paragraph | 2.0 |
### Per-crawl breakdown
| # | Crawl Dump | Sentences | Paragraphs | Words | Avg Sent | Avg Para | Docs | Shards | Source | Output |
|---|------------|-----------|------------|-------|----------|----------|------|--------|--------|--------|
| 1 | `CC-MAIN-2016-18` | 5,051,263,821 | 2,542,010,464 | 11,019,704,068 | 95.2 | 189.3 | 156,845,706 | 162 | 297.4 GB | 1.1 TB |
| 2 | `CC-MAIN-2015-40` | 4,832,015,199 | 2,410,792,807 | 0 | 94.9 | 190.2 | 149,851,113 | 150 | 283.3 GB | 542.5 GB |
| 3 | `CC-MAIN-2016-26` | 4,383,722,027 | 2,198,498,865 | 7,155,644,839 | 94.6 | 188.7 | 131,505,027 | 150 | 256.0 GB | 483.1 GB |
| 4 | `CC-MAIN-2015-18` | 198,383,722 | 100,553,721 | 3,289,732,703 | 97.2 | 192.7 | 6,463,510 | 6 | 12.0 GB | 181.1 GB |
## How it works
The pipeline is a single Go binary that walks every shard FineWeb has ever published,
splits the documents inside, and commits the results back to HuggingFace one shard at a
time. The scale is what makes the design interesting: FineWeb is 48.6 TB spread across
roughly 27,500 parquet shards from 110 Common Crawl snapshots, and any stage that tries
to hold more than one shard's worth of data in memory or on disk will eventually exhaust
the machine it runs on.
The core design choice is *process one shard end-to-end, persist nothing worth losing*.
A shard is small enough to decompress into working memory, large enough to amortize the
fixed cost of a HuggingFace commit, and self-contained enough that a crash mid-flight
costs minutes rather than hours. Every other decision in the pipeline — the sequential
download strategy, the lack of an external database, the refusal to batch commits across
shards — flows from that principle.
### The stages
**Download.** Source shards are pulled sequentially from HuggingFace over plain HTTP.
We do not fan out parallel downloads: the split stage keeps the CPU saturated on its
own, and parallel downloads would only invite rate limits without meaningfully
shortening wall-clock time. Downloads are idempotent by file size — a restart silently
skips shards that are already fully on disk and re-fetches anything that was cut off
mid-transfer.
**Read.** Shards are streamed row-by-row via `parquet-go`, in batches of 10,000 rows
when the words and n-grams configs are enabled and up to 50,000 rows when only sentences
and paragraphs are being extracted. The batch size is not arbitrary: per-worker
frequency maps scale roughly linearly with the batch size, and 10K rows × 6 workers ×
~500 words per document × four n-gram sizes is already enough data to push a naive
implementation past a sensible memory ceiling. Reads are pipelined — the next batch is
prefetched while the current one is being split — so there is no I/O stall between
batches.
**Split.** Each batch is sharded across worker goroutines (one per CPU) that
independently run the segmentation logic described in the next section. Workers keep
thread-local frequency maps for words and n-grams to avoid lock contention in the hot
loop, and the per-worker maps are merged into the per-shard totals only at batch
boundaries. That merge is the only synchronization point during processing.
Frequency maps are pruned when they cross one million unique entries: rows with a count
of one are evicted first, and if that is not enough, the next-lowest-frequency rows
follow. Zipf's law makes this almost free in practice — the words and n-grams anyone
will ever query sit in the long head of the distribution, while the discarded tail is
dominated by typos, OCR artifacts, and one-off URL fragments that were never useful
signal to begin with.
**Write.** Sentences, paragraphs, words, and n-grams are written to four separate
Snappy-compressed parquet files with 50,000 rows per row group. Snappy compresses web
text to roughly half its raw size and decompresses fast enough that DuckDB can scan the
dataset at full HTTP bandwidth without the CPU becoming the bottleneck. We deliberately
chose Snappy over Zstandard after benchmarking both: Zstandard produced noticeably
smaller files but was significantly slower on the read path, and read throughput is
what matters for a dataset meant to be queried over `hf://` URLs.
Row groups of 50,000 rows keep metadata overhead low while remaining small enough for
DuckDB's predicate pushdown to skip irrelevant groups when users filter by `dump` or
`doc_url`.
**Publish.** The four output files, a refreshed `stats.csv`, and a newly rendered
`README.md` are committed to HuggingFace as a single LFS-aware commit. Either every
file in the commit lands or none of them do, so a partial upload never leaves the
dataset in a half-written state.
HuggingFace rate limits are treated as first-class operational events. A 429 response
honors the `Retry-After` header when present and falls back to a two-minute wait when
it is not; other transient errors are retried with a linear backoff (30, 60, 90, 120,
150 seconds) up to five attempts. Beyond that, the shard is skipped for this run and
will be retried on the next pipeline invocation — a consequence of keeping `stats.csv`
as the only state of record.
**Clean up.** After a successful publish, the source shard and the four output files
are deleted. This is what lets the pipeline run indefinitely on a VM with 40–80 GB of
free disk while processing tens of terabytes over the course of days. It also means
`stats.csv` is the only signal that a shard has been completed — an absent output file
is indistinguishable from one that never existed, and the stats file carries the full
history.
### Resumability and state
The pipeline keeps exactly one piece of durable state: `stats.csv`, which records every
completed (dump, shard) pair along with its counts and byte totals. On startup it reads
the file, diffs the finished set against the list of source shards that still exist on
HuggingFace, and starts working on the remainder. There is no database, no queue, no
lock file, and no distributed coordination — just a flat CSV that happens to also be
human-readable and checked into the published dataset.
An earlier iteration used DuckDB for state tracking, which worked but added operational
overhead: backups, schema migrations, the occasional recovery from a partially written
database file. Falling back to CSV removed an entire category of failures and costs
almost nothing in performance. Even with 27,000+ rows, parsing the file at startup
takes well under a second, and append-only writes are atomic at the OS level for small
buffers.
The same `stats.csv` is committed to the HuggingFace repo on every shard publish, which
means the dataset itself is its own ledger. A fresh machine with no local state can
clone the repo, read the CSV, and pick up exactly where the last machine left off.
### Resource budgets
The pipeline runs comfortably inside these ceilings on a 4-core VM with 8 GB of RAM:
| Resource | Budget | How |
|----------|--------|-----|
| **Memory** | ~200 MB resident | 10K-row parquet batches, frequency maps pruned at 1M entries |
| **Disk** | ~10 GB peak | One shard in flight, deleted after successful publish |
| **Network** | Sequential | One download and one commit at a time; backoff on 429 and 5xx |
These budgets are intentionally conservative. When the pipeline falls over, it is almost
always because of something external — a HuggingFace Hub incident, a transient DNS
failure, an OOM from some other process on the same VM — and the design means those
failures cost minutes of lost work rather than hours.
## Splitting methodology
### Sentence splitting
Sentence segmentation uses punctuation and casing heuristics tuned for English web
text. The rules are designed to be conservative, preferring to keep text together
rather than over-splitting. For short texts (under 500 characters), we use
[sentencex](https://github.com/wikimedia/sentencex), a Wikimedia project that provides
language-specific sentence boundary detection with knowledge of English abbreviation
patterns and punctuation norms.
| Rule | Example | Behavior |
|------|---------|----------|
| Period + space + uppercase | `world. The` | Split |
| Abbreviation + period | `Mr. Smith` | No split |
| Decimal number | `3.14 is` | No split |
| Single-letter initial | `J. K. Rowling` | No split |
| Exclamation/question | `really! What` | Split |
| Newline after 10+ chars | `long text\nNext` | Split |
| No space after period | `end.Next` | No split |
### Word splitting
Word extraction follows a straightforward pipeline designed to produce clean,
normalized tokens suitable for frequency analysis:
1. NFC normalization (Unicode canonical composition) to ensure that equivalent
character sequences are represented identically
2. Lowercase conversion for case-insensitive frequency counting
3. Splitting on non-letter, non-digit boundaries, while preserving apostrophes
and hyphens that appear mid-word (e.g. "don't", "well-known")
4. Stripping of leading and trailing punctuation
5. Filtering of empty strings and pure-punctuation tokens
### Paragraph splitting
FineWeb's source text comes from HTML pages processed by trafilatura, a web content
extraction library. Paragraph boundaries are represented as single newlines (`\n`)
in the extracted text. We split on these newlines:
1. Split on single newlines
2. Trim leading and trailing whitespace from each paragraph
3. Discard fragments shorter than 20 characters, which typically correspond to
navigation elements, single-word headers, or other structural debris from the
original HTML
This simple approach works well in practice because trafilatura has already done the
hard work of extracting meaningful content blocks from the HTML.
### N-gram extraction
N-grams are extracted by sliding a window of size *n* over the word token sequence
for each document. We compute bigrams (n=2), trigrams (n=3), 4-grams, and 5-grams.
| N | Name | Example from "the quick brown fox" |
|---|------|-------------------------------------|
| 2 | Bigram | "the quick", "quick brown", "brown fox" |
| 3 | Trigram | "the quick brown", "quick brown fox" |
| 4 | 4-gram | "the quick brown fox" |
| 5 | 5-gram | *(needs 5+ words)* |
To keep memory usage bounded, per-shard frequency maps are pruned when they exceed
1 million unique entries. During pruning, entries with a frequency of 1 are evicted
first. This means that very rare n-grams in large shards may be undercounted, but the
most frequent and analytically useful n-grams are preserved accurately.
## Dataset card
### Dataset summary
FineWeb NLP provides pre-segmented versions of HuggingFace's
[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset. Each of the
approximately 25.9 billion source documents has been split into sentences, paragraphs,
words, and n-grams. The `dump` field on every row identifies the Common Crawl snapshot,
enabling temporal analysis of English language use from 2013 to 2025.
### Data instances
**Sentence:**
```json
{
"sentence": "The quick brown fox jumps over the lazy dog.",
"doc_id": "f7ef49fc-6899-4d56-aaa7-bea5924802f3",
"doc_url": "https://example.com/article",
"position": 0,
"dump": "CC-MAIN-2024-10"
}
```
**Word:**
```json
{
"word": "the",
"frequency": 12847,
"doc_frequency": 9412,
"dump": "CC-MAIN-2024-10"
}
```
**N-gram:**
```json
{
"ngram": "of the",
"n": 2,
"frequency": 4523,
"dump": "CC-MAIN-2024-10"
}
```
### Curation rationale
Sentence-level and word-level datasets are foundational for many areas of NLP research.
They are used to train sentence embeddings, build and evaluate language models, study
word frequency distributions and Zipf's law, analyze collocations and phrasal patterns,
and benchmark NLP tools. Having these units pre-extracted and ready to query saves
researchers significant time and computational resources. The temporal dimension
provided by Common Crawl snapshots also enables studies of how language use evolves
over time on the English-speaking web.
### Source data
All text originates from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
FineWeb was constructed by extracting text from approximately 110 Common Crawl snapshots
using trafilatura, filtering with FastText for English (minimum confidence 0.65), applying
quality filters (Gopher, C4, FineWeb-specific), and deduplicating per crawl with MinHash.
We do not apply any additional filtering or deduplication beyond what FineWeb provides.
### Considerations for using the data
There are several important limitations to keep in mind when working with this dataset:
**English-only with threshold filtering.** FineWeb uses a minimum FastText confidence of
0.65 for English. Some documents near the threshold may contain mixed-language content.
Sentence splitting accuracy may be lower for these documents.
**Per-shard word frequencies.** Word and n-gram frequencies are computed per source shard,
not aggregated globally. To get corpus-level frequencies, aggregate with
`sum(frequency) GROUP BY word` in DuckDB or any query engine that can read Parquet.
**Temporal coverage.** Common Crawl snapshots are not uniformly distributed over time.
Some years have more snapshots than others, and crawl coverage varies. When comparing
word frequencies across crawls, be aware that differences may partly reflect changes in
crawl scope rather than genuine shifts in language use.
**No additional PII filtering.** This dataset does not apply any personally identifiable
information filtering beyond what was already done upstream by the FineWeb team. Web
text inherently contains names, email addresses, and other personal information.
### License
[ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) (Open Data Commons Attribution License),
following FineWeb's license.
### Author
Created by **Duc-Tam Nguyen** ([tamnd](https://huggingface.co/tamnd)) as part of the
[open-index](https://huggingface.co/open-index) project.
### Citation
```bibtex
@misc{finewebnlp2026,
title = {FineWeb NLP: Sentences, Paragraphs, Words, and N-grams},
author = {Nguyen, Duc-Tam},
year = {2026},
url = {https://huggingface.co/datasets/open-index/fineweb-nlp},
note = {Derived from FineWeb (HuggingFaceFW/fineweb)}
}
@article{penedo2024fineweb,
title = {The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
author = {Guilherme Penedo and others},
year = {2024},
eprint = {2406.17557},
archivePrefix = {arXiv}
}
```
---
*Last updated: 2026-04-20 16:03 UTC*
提供机构:
open-index



