five

open-index/fineweb-nlp

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/open-index/fineweb-nlp
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation - feature-extraction - text-classification language: - en pretty_name: "FineWeb NLP" size_categories: - 10B<n<100B tags: - parquet - fineweb - nlp - sentences - paragraphs - words - ngrams - english configs: - config_name: sentences data_files: - split: train path: data/sentences/**/*.parquet - config_name: paragraphs data_files: - split: train path: data/paragraphs/**/*.parquet - config_name: words data_files: - split: train path: data/words/**/*.parquet - config_name: ngrams data_files: - split: train path: data/ngrams/**/*.parquet - config_name: sentences-CC-MAIN-2016-18 data_files: - split: train path: data/sentences/CC-MAIN-2016-18/*.parquet - config_name: paragraphs-CC-MAIN-2016-18 data_files: - split: train path: data/paragraphs/CC-MAIN-2016-18/*.parquet - config_name: words-CC-MAIN-2016-18 data_files: - split: train path: data/words/CC-MAIN-2016-18/*.parquet - config_name: ngrams-CC-MAIN-2016-18 data_files: - split: train path: data/ngrams/CC-MAIN-2016-18/*.parquet - config_name: sentences-CC-MAIN-2015-40 data_files: - split: train path: data/sentences/CC-MAIN-2015-40/*.parquet - config_name: paragraphs-CC-MAIN-2015-40 data_files: - split: train path: data/paragraphs/CC-MAIN-2015-40/*.parquet - config_name: words-CC-MAIN-2015-40 data_files: - split: train path: data/words/CC-MAIN-2015-40/*.parquet - config_name: ngrams-CC-MAIN-2015-40 data_files: - split: train path: data/ngrams/CC-MAIN-2015-40/*.parquet - config_name: sentences-CC-MAIN-2016-26 data_files: - split: train path: data/sentences/CC-MAIN-2016-26/*.parquet - config_name: paragraphs-CC-MAIN-2016-26 data_files: - split: train path: data/paragraphs/CC-MAIN-2016-26/*.parquet - config_name: words-CC-MAIN-2016-26 data_files: - split: train path: data/words/CC-MAIN-2016-26/*.parquet - config_name: ngrams-CC-MAIN-2016-26 data_files: - split: train path: data/ngrams/CC-MAIN-2016-26/*.parquet - config_name: sentences-CC-MAIN-2015-18 data_files: - split: train path: data/sentences/CC-MAIN-2015-18/*.parquet - config_name: paragraphs-CC-MAIN-2015-18 data_files: - split: train path: data/paragraphs/CC-MAIN-2015-18/*.parquet - config_name: words-CC-MAIN-2015-18 data_files: - split: train path: data/words/CC-MAIN-2015-18/*.parquet - config_name: ngrams-CC-MAIN-2015-18 data_files: - split: train path: data/ngrams/CC-MAIN-2015-18/*.parquet --- # FineWeb NLP **14,465,384,769 sentences** and **7,251,855,857 paragraphs** from **444,665,356 English documents** (848.6 GB source data) in [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb). Every sentence, paragraph, word frequency, and n-gram frequency, split with language-aware segmentation and continuously updated. ## Table of Contents - [What is this?](#what-is-this) - [What is being released?](#what-is-being-released) - [Data organization](#data-organization) - [Sentence distribution by crawl](#sentence-distribution-by-crawl) - [Paragraph distribution by crawl](#paragraph-distribution-by-crawl) - [Splitting quality overview](#splitting-quality-overview) - [How to download and use this dataset](#how-to-download-and-use-this-dataset) - [Dataset statistics](#dataset-statistics) - [How it works](#how-it-works) - [Splitting methodology](#splitting-methodology) - [Dataset card](#dataset-card) --- ## What is this? [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is HuggingFace's curated English web text corpus. It contains approximately 25.9 billion documents totaling 48.6 TB of text and 18.5 trillion tokens, drawn from 110 Common Crawl snapshots spanning 2013 to 2025. The text has been filtered for quality using Gopher filters, C4 filters, and FineWeb-specific heuristics, then deduplicated per crawl using MinHash. Working directly with FineWeb requires downloading and processing tens of terabytes of parquet files. Most researchers need just the sentences, or just the word frequencies, or just a specific crawl period. They should not have to process the entire corpus to get there. **FineWeb NLP** solves this by pre-segmenting every document in FineWeb into four linguistically useful units: | Type | Rows | What you get | |------|------|-------------| | **sentences** | 14,465,384,769 | One row per sentence, with source document ID, URL, and position index | | **paragraphs** | 7,251,855,857 | One row per paragraph, with sentence count per paragraph | | **words** | 974,348,267 | Per-shard word frequency and document frequency tables | | **ngrams** | 51,023,752,730 | Per-shard bigram through 5-gram frequency tables | Every row traces back to its source document through `doc_id` and `doc_url` fields. The `dump` field identifies which Common Crawl snapshot the document came from, allowing temporal analysis of language use across a decade of web content. ### Why per-shard frequency tables? Words and n-grams are computed **per source shard** rather than aggregated into a single global table. FineWeb has 27,468 source shards, and building a single global frequency table would require holding billions of unique entries in memory simultaneously. By keeping frequencies per-shard, each output file stays small and self-contained. Aggregation is straightforward. A single DuckDB query can combine all shards in seconds: ```sql SELECT word, sum(frequency) as total_freq, sum(doc_frequency) as total_doc_freq FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 100; ``` ## What is being released? Four dataset configs, all stored as Snappy-compressed Parquet files: ### 1. Sentences (`config_name: sentences`) | Column | Type | Description | |--------|------|-------------| | `sentence` | string | The extracted sentence | | `doc_id` | string | Source document UUID from FineWeb | | `doc_url` | string | Original web page URL | | `position` | int32 | 0-based sentence index within the document | | `length` | int32 | Sentence length in UTF-8 bytes (equal to `LENGTH(sentence)`) | | `dump` | string | Common Crawl dump (e.g. `CC-MAIN-2024-10`) | ### 2. Paragraphs (`config_name: paragraphs`) | Column | Type | Description | |--------|------|-------------| | `paragraph` | string | The paragraph text | | `doc_id` | string | Source document UUID | | `doc_url` | string | Original web page URL | | `position` | int32 | 0-based paragraph index within the document | | `length` | int32 | Paragraph length in UTF-8 bytes (equal to `LENGTH(paragraph)`) | | `dump` | string | Common Crawl dump | | `sentence_count` | int32 | Number of sentences detected in this paragraph | ### 3. Words (`config_name: words`) | Column | Type | Description | |--------|------|-------------| | `word` | string | Lowercased, NFC-normalized word | | `frequency` | int64 | Occurrence count within this shard | | `doc_frequency` | int64 | Documents containing this word (within shard) | | `dump` | string | Common Crawl dump | ### 4. N-grams (`config_name: ngrams`) | Column | Type | Description | |--------|------|-------------| | `ngram` | string | Space-joined n-gram (e.g. "of the", "in the world") | | `n` | int32 | N-gram size: 2 (bigram), 3 (trigram), 4, or 5 | | `frequency` | int64 | Occurrence count within this shard | | `dump` | string | Common Crawl dump | ## Data organization ``` open-index/fineweb-nlp/ ├── README.md ├── stats.csv └── data/ ├── sentences/ │ ├── CC-MAIN-2024-10/ │ │ ├── 000_00000.parquet │ │ └── ... │ └── {dump}/{shard}.parquet ├── paragraphs/ │ └── {dump}/{shard}.parquet ├── words/ │ └── {dump}/{shard}.parquet └── ngrams/ └── {dump}/{shard}.parquet ``` Each source FineWeb shard maps to exactly one output file per type per dump. Shard names match the source file names (e.g. `000_00000`, `005_00049`). ## Sentence distribution by crawl ``` CC-MAIN-2016-18 ████████████████████████████████████████ 5,051,263,821 CC-MAIN-2015-40 ██████████████████████████████████████ 4,832,015,199 CC-MAIN-2016-26 ██████████████████████████████████ 4,383,722,027 CC-MAIN-2015-18 █ 198,383,722 ``` <details> <summary>SQL to reproduce this chart</summary> ```sql SELECT dump, count(*) as sentences FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet' GROUP BY dump ORDER BY sentences DESC LIMIT 30; ``` </details> ## Paragraph distribution by crawl ``` CC-MAIN-2016-18 ████████████████████████████████████████ 2,542,010,464 CC-MAIN-2015-40 █████████████████████████████████████ 2,410,792,807 CC-MAIN-2016-26 ██████████████████████████████████ 2,198,498,865 CC-MAIN-2015-18 █ 100,553,721 ``` <details> <summary>SQL to reproduce this chart</summary> ```sql SELECT dump, count(*) as paragraphs FROM 'hf://datasets/open-index/fineweb-nlp/data/paragraphs/**/*.parquet' GROUP BY dump ORDER BY paragraphs DESC LIMIT 20; ``` </details> ## Splitting quality overview ``` CC-MAIN-2016-26 ████████████████████████████████████████ 33.3 CC-MAIN-2015-40 ██████████████████████████████████████ 32.2 CC-MAIN-2016-18 ██████████████████████████████████████ 32.2 CC-MAIN-2015-18 ████████████████████████████████████ 30.7 ``` The chart above shows the average number of sentences extracted per source document for each crawl snapshot. This metric serves as a rough proxy for content quality and structural richness. Crawls where the average is high tend to contain longer, well-structured articles with clear paragraph and sentence boundaries. Crawls with lower averages typically have shorter source documents or contain more boilerplate content that was not fully filtered during FineWeb's quality filtering stage. ## How to download and use this dataset ### 1. DuckDB (recommended for exploration) DuckDB can query HuggingFace parquet files directly over HTTP without downloading anything to disk. This makes it the fastest way to explore the dataset. ```sql -- Count sentences per crawl dump SELECT dump, count(*) as sentences FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet' GROUP BY dump ORDER BY sentences DESC; -- Read sentences from a specific crawl SELECT sentence, doc_url FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet' LIMIT 20; -- Top 100 most frequent English words SELECT word, sum(frequency) as total_freq FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 100; -- Most common bigrams SELECT ngram, sum(frequency) as total_freq FROM 'hf://datasets/open-index/fineweb-nlp/data/ngrams/**/*.parquet' WHERE n = 2 GROUP BY ngram ORDER BY total_freq DESC LIMIT 50; -- Average sentences per document per crawl SELECT dump, count(DISTINCT doc_id) as docs, count(*) as sentences, round(count(*) * 1.0 / count(DISTINCT doc_id), 1) as avg_sent_per_doc FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/**/*.parquet' GROUP BY dump ORDER BY sentences DESC LIMIT 20; -- Find sentences containing a specific phrase SELECT sentence, doc_url, dump FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet' WHERE sentence ILIKE '%artificial intelligence%' LIMIT 20; -- Word frequency trends across crawls SELECT dump, sum(frequency) as freq FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet' WHERE word = 'ai' GROUP BY dump ORDER BY dump; ``` ### 2. Python (datasets library) ```python from datasets import load_dataset # Stream all sentences (no full download needed) ds = load_dataset("open-index/fineweb-nlp", "sentences", split="train", streaming=True) for row in ds.take(10): print(f"[{row['dump']}] {row['sentence'][:100]}") # Load paragraphs for a specific crawl ds = load_dataset("open-index/fineweb-nlp", "paragraphs-CC-MAIN-2024-10", split="train", streaming=True) # Word frequencies ds = load_dataset("open-index/fineweb-nlp", "words", split="train", streaming=True) for row in ds.take(20): print(f"{row['word']:20s} freq={row['frequency']:>12,} doc_freq={row['doc_frequency']:>8,}") # N-gram analysis ds = load_dataset("open-index/fineweb-nlp", "ngrams", split="train", streaming=True) bigrams = (row for row in ds if row["n"] == 2) ``` ### 3. huggingface_hub CLI ```bash # Download sentences from one crawl huggingface-cli download open-index/fineweb-nlp --include "data/sentences/CC-MAIN-2024-10/*" --repo-type dataset # Download all word frequencies huggingface-cli download open-index/fineweb-nlp --include "data/words/**/*" --repo-type dataset # Download everything for one crawl huggingface-cli download open-index/fineweb-nlp --include "data/*/CC-MAIN-2024-10/*" --repo-type dataset ``` ### 4. pandas + DuckDB ```python import duckdb conn = duckdb.connect() # Sentences as DataFrame df = conn.sql(""" SELECT sentence, doc_url, position FROM 'hf://datasets/open-index/fineweb-nlp/data/sentences/CC-MAIN-2024-10/*.parquet' LIMIT 1000 """).df() print(f"Loaded {len(df):,} sentences") print(df.head(10)) # Word frequency analysis words_df = conn.sql(""" SELECT word, sum(frequency) as total_freq FROM 'hf://datasets/open-index/fineweb-nlp/data/words/**/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 200 """).df() print(words_df) ``` ## Dataset statistics | Metric | Value | |--------|-------| | **Total sentences** | **14,465,384,769** | | **Total paragraphs** | **7,251,855,857** | | **Unique word entries** (per-shard) | 974,348,267 | | **Total n-gram entries** (per-shard) | 51,023,752,730 | | **Crawl dumps processed** | **4** | | **Source documents** | **444,665,356** | | **Source data processed** | **848.6 GB** | | **Output parquet size** | **2.3 TB** | | Avg sentence length | 94.9 chars | | Avg paragraph length | 189.5 chars | | Avg sentences per document | 32.5 | | Avg paragraphs per document | 16.3 | | Avg sentences per paragraph | 2.0 | ### Per-crawl breakdown | # | Crawl Dump | Sentences | Paragraphs | Words | Avg Sent | Avg Para | Docs | Shards | Source | Output | |---|------------|-----------|------------|-------|----------|----------|------|--------|--------|--------| | 1 | `CC-MAIN-2016-18` | 5,051,263,821 | 2,542,010,464 | 11,019,704,068 | 95.2 | 189.3 | 156,845,706 | 162 | 297.4 GB | 1.1 TB | | 2 | `CC-MAIN-2015-40` | 4,832,015,199 | 2,410,792,807 | 0 | 94.9 | 190.2 | 149,851,113 | 150 | 283.3 GB | 542.5 GB | | 3 | `CC-MAIN-2016-26` | 4,383,722,027 | 2,198,498,865 | 7,155,644,839 | 94.6 | 188.7 | 131,505,027 | 150 | 256.0 GB | 483.1 GB | | 4 | `CC-MAIN-2015-18` | 198,383,722 | 100,553,721 | 3,289,732,703 | 97.2 | 192.7 | 6,463,510 | 6 | 12.0 GB | 181.1 GB | ## How it works The pipeline is a single Go binary that walks every shard FineWeb has ever published, splits the documents inside, and commits the results back to HuggingFace one shard at a time. The scale is what makes the design interesting: FineWeb is 48.6 TB spread across roughly 27,500 parquet shards from 110 Common Crawl snapshots, and any stage that tries to hold more than one shard's worth of data in memory or on disk will eventually exhaust the machine it runs on. The core design choice is *process one shard end-to-end, persist nothing worth losing*. A shard is small enough to decompress into working memory, large enough to amortize the fixed cost of a HuggingFace commit, and self-contained enough that a crash mid-flight costs minutes rather than hours. Every other decision in the pipeline — the sequential download strategy, the lack of an external database, the refusal to batch commits across shards — flows from that principle. ### The stages **Download.** Source shards are pulled sequentially from HuggingFace over plain HTTP. We do not fan out parallel downloads: the split stage keeps the CPU saturated on its own, and parallel downloads would only invite rate limits without meaningfully shortening wall-clock time. Downloads are idempotent by file size — a restart silently skips shards that are already fully on disk and re-fetches anything that was cut off mid-transfer. **Read.** Shards are streamed row-by-row via `parquet-go`, in batches of 10,000 rows when the words and n-grams configs are enabled and up to 50,000 rows when only sentences and paragraphs are being extracted. The batch size is not arbitrary: per-worker frequency maps scale roughly linearly with the batch size, and 10K rows × 6 workers × ~500 words per document × four n-gram sizes is already enough data to push a naive implementation past a sensible memory ceiling. Reads are pipelined — the next batch is prefetched while the current one is being split — so there is no I/O stall between batches. **Split.** Each batch is sharded across worker goroutines (one per CPU) that independently run the segmentation logic described in the next section. Workers keep thread-local frequency maps for words and n-grams to avoid lock contention in the hot loop, and the per-worker maps are merged into the per-shard totals only at batch boundaries. That merge is the only synchronization point during processing. Frequency maps are pruned when they cross one million unique entries: rows with a count of one are evicted first, and if that is not enough, the next-lowest-frequency rows follow. Zipf's law makes this almost free in practice — the words and n-grams anyone will ever query sit in the long head of the distribution, while the discarded tail is dominated by typos, OCR artifacts, and one-off URL fragments that were never useful signal to begin with. **Write.** Sentences, paragraphs, words, and n-grams are written to four separate Snappy-compressed parquet files with 50,000 rows per row group. Snappy compresses web text to roughly half its raw size and decompresses fast enough that DuckDB can scan the dataset at full HTTP bandwidth without the CPU becoming the bottleneck. We deliberately chose Snappy over Zstandard after benchmarking both: Zstandard produced noticeably smaller files but was significantly slower on the read path, and read throughput is what matters for a dataset meant to be queried over `hf://` URLs. Row groups of 50,000 rows keep metadata overhead low while remaining small enough for DuckDB's predicate pushdown to skip irrelevant groups when users filter by `dump` or `doc_url`. **Publish.** The four output files, a refreshed `stats.csv`, and a newly rendered `README.md` are committed to HuggingFace as a single LFS-aware commit. Either every file in the commit lands or none of them do, so a partial upload never leaves the dataset in a half-written state. HuggingFace rate limits are treated as first-class operational events. A 429 response honors the `Retry-After` header when present and falls back to a two-minute wait when it is not; other transient errors are retried with a linear backoff (30, 60, 90, 120, 150 seconds) up to five attempts. Beyond that, the shard is skipped for this run and will be retried on the next pipeline invocation — a consequence of keeping `stats.csv` as the only state of record. **Clean up.** After a successful publish, the source shard and the four output files are deleted. This is what lets the pipeline run indefinitely on a VM with 40–80 GB of free disk while processing tens of terabytes over the course of days. It also means `stats.csv` is the only signal that a shard has been completed — an absent output file is indistinguishable from one that never existed, and the stats file carries the full history. ### Resumability and state The pipeline keeps exactly one piece of durable state: `stats.csv`, which records every completed (dump, shard) pair along with its counts and byte totals. On startup it reads the file, diffs the finished set against the list of source shards that still exist on HuggingFace, and starts working on the remainder. There is no database, no queue, no lock file, and no distributed coordination — just a flat CSV that happens to also be human-readable and checked into the published dataset. An earlier iteration used DuckDB for state tracking, which worked but added operational overhead: backups, schema migrations, the occasional recovery from a partially written database file. Falling back to CSV removed an entire category of failures and costs almost nothing in performance. Even with 27,000+ rows, parsing the file at startup takes well under a second, and append-only writes are atomic at the OS level for small buffers. The same `stats.csv` is committed to the HuggingFace repo on every shard publish, which means the dataset itself is its own ledger. A fresh machine with no local state can clone the repo, read the CSV, and pick up exactly where the last machine left off. ### Resource budgets The pipeline runs comfortably inside these ceilings on a 4-core VM with 8 GB of RAM: | Resource | Budget | How | |----------|--------|-----| | **Memory** | ~200 MB resident | 10K-row parquet batches, frequency maps pruned at 1M entries | | **Disk** | ~10 GB peak | One shard in flight, deleted after successful publish | | **Network** | Sequential | One download and one commit at a time; backoff on 429 and 5xx | These budgets are intentionally conservative. When the pipeline falls over, it is almost always because of something external — a HuggingFace Hub incident, a transient DNS failure, an OOM from some other process on the same VM — and the design means those failures cost minutes of lost work rather than hours. ## Splitting methodology ### Sentence splitting Sentence segmentation uses punctuation and casing heuristics tuned for English web text. The rules are designed to be conservative, preferring to keep text together rather than over-splitting. For short texts (under 500 characters), we use [sentencex](https://github.com/wikimedia/sentencex), a Wikimedia project that provides language-specific sentence boundary detection with knowledge of English abbreviation patterns and punctuation norms. | Rule | Example | Behavior | |------|---------|----------| | Period + space + uppercase | `world. The` | Split | | Abbreviation + period | `Mr. Smith` | No split | | Decimal number | `3.14 is` | No split | | Single-letter initial | `J. K. Rowling` | No split | | Exclamation/question | `really! What` | Split | | Newline after 10+ chars | `long text\nNext` | Split | | No space after period | `end.Next` | No split | ### Word splitting Word extraction follows a straightforward pipeline designed to produce clean, normalized tokens suitable for frequency analysis: 1. NFC normalization (Unicode canonical composition) to ensure that equivalent character sequences are represented identically 2. Lowercase conversion for case-insensitive frequency counting 3. Splitting on non-letter, non-digit boundaries, while preserving apostrophes and hyphens that appear mid-word (e.g. "don't", "well-known") 4. Stripping of leading and trailing punctuation 5. Filtering of empty strings and pure-punctuation tokens ### Paragraph splitting FineWeb's source text comes from HTML pages processed by trafilatura, a web content extraction library. Paragraph boundaries are represented as single newlines (`\n`) in the extracted text. We split on these newlines: 1. Split on single newlines 2. Trim leading and trailing whitespace from each paragraph 3. Discard fragments shorter than 20 characters, which typically correspond to navigation elements, single-word headers, or other structural debris from the original HTML This simple approach works well in practice because trafilatura has already done the hard work of extracting meaningful content blocks from the HTML. ### N-gram extraction N-grams are extracted by sliding a window of size *n* over the word token sequence for each document. We compute bigrams (n=2), trigrams (n=3), 4-grams, and 5-grams. | N | Name | Example from "the quick brown fox" | |---|------|-------------------------------------| | 2 | Bigram | "the quick", "quick brown", "brown fox" | | 3 | Trigram | "the quick brown", "quick brown fox" | | 4 | 4-gram | "the quick brown fox" | | 5 | 5-gram | *(needs 5+ words)* | To keep memory usage bounded, per-shard frequency maps are pruned when they exceed 1 million unique entries. During pruning, entries with a frequency of 1 are evicted first. This means that very rare n-grams in large shards may be undercounted, but the most frequent and analytically useful n-grams are preserved accurately. ## Dataset card ### Dataset summary FineWeb NLP provides pre-segmented versions of HuggingFace's [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset. Each of the approximately 25.9 billion source documents has been split into sentences, paragraphs, words, and n-grams. The `dump` field on every row identifies the Common Crawl snapshot, enabling temporal analysis of English language use from 2013 to 2025. ### Data instances **Sentence:** ```json { "sentence": "The quick brown fox jumps over the lazy dog.", "doc_id": "f7ef49fc-6899-4d56-aaa7-bea5924802f3", "doc_url": "https://example.com/article", "position": 0, "dump": "CC-MAIN-2024-10" } ``` **Word:** ```json { "word": "the", "frequency": 12847, "doc_frequency": 9412, "dump": "CC-MAIN-2024-10" } ``` **N-gram:** ```json { "ngram": "of the", "n": 2, "frequency": 4523, "dump": "CC-MAIN-2024-10" } ``` ### Curation rationale Sentence-level and word-level datasets are foundational for many areas of NLP research. They are used to train sentence embeddings, build and evaluate language models, study word frequency distributions and Zipf's law, analyze collocations and phrasal patterns, and benchmark NLP tools. Having these units pre-extracted and ready to query saves researchers significant time and computational resources. The temporal dimension provided by Common Crawl snapshots also enables studies of how language use evolves over time on the English-speaking web. ### Source data All text originates from [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb). FineWeb was constructed by extracting text from approximately 110 Common Crawl snapshots using trafilatura, filtering with FastText for English (minimum confidence 0.65), applying quality filters (Gopher, C4, FineWeb-specific), and deduplicating per crawl with MinHash. We do not apply any additional filtering or deduplication beyond what FineWeb provides. ### Considerations for using the data There are several important limitations to keep in mind when working with this dataset: **English-only with threshold filtering.** FineWeb uses a minimum FastText confidence of 0.65 for English. Some documents near the threshold may contain mixed-language content. Sentence splitting accuracy may be lower for these documents. **Per-shard word frequencies.** Word and n-gram frequencies are computed per source shard, not aggregated globally. To get corpus-level frequencies, aggregate with `sum(frequency) GROUP BY word` in DuckDB or any query engine that can read Parquet. **Temporal coverage.** Common Crawl snapshots are not uniformly distributed over time. Some years have more snapshots than others, and crawl coverage varies. When comparing word frequencies across crawls, be aware that differences may partly reflect changes in crawl scope rather than genuine shifts in language use. **No additional PII filtering.** This dataset does not apply any personally identifiable information filtering beyond what was already done upstream by the FineWeb team. Web text inherently contains names, email addresses, and other personal information. ### License [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) (Open Data Commons Attribution License), following FineWeb's license. ### Author Created by **Duc-Tam Nguyen** ([tamnd](https://huggingface.co/tamnd)) as part of the [open-index](https://huggingface.co/open-index) project. ### Citation ```bibtex @misc{finewebnlp2026, title = {FineWeb NLP: Sentences, Paragraphs, Words, and N-grams}, author = {Nguyen, Duc-Tam}, year = {2026}, url = {https://huggingface.co/datasets/open-index/fineweb-nlp}, note = {Derived from FineWeb (HuggingFaceFW/fineweb)} } @article{penedo2024fineweb, title = {The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale}, author = {Guilherme Penedo and others}, year = {2024}, eprint = {2406.17557}, archivePrefix = {arXiv} } ``` --- *Last updated: 2026-04-20 16:03 UTC*
提供机构:
open-index
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作