guidelabs/fineweb-atlas
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/guidelabs/fineweb-atlas
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: FineWeb Atlas (v0.1)
language:
- en
license: odc-by
task_categories:
- text-classification
- text-retrieval
- feature-extraction
size_categories:
- 10M<n<100M
annotations_creators:
- machine-generated
tags:
- fineweb
- concept-annotation
- topic-modeling
- text-mining
- cooccurrence
- taxonomy
source_datasets:
- HuggingFaceFW/fineweb
configs:
- config_name: concepts
default: true
data_files:
- split: train
path: fineweb-concept-atlas.parquet
- config_name: documents
data_files:
- split: train
path: fineweb-atlas-documents/*.parquet
- config_name: chunks
data_files:
- split: train
path: fineweb-atlas-annotated/*.parquet
- config_name: field_guide
data_files:
- split: train
path: fineweb-atlas-annotated-reverse-index/*.parquet
dataset_info:
- config_name: concepts
features:
- name: concept_id
dtype: uint32
- name: concept_type
dtype: string
- name: name
dtype: string
- name: description
dtype: string
- name: taxonomy_lcc_path_primary
dtype: string
- name: chunk_count
dtype: int64
- name: chunk_prevalence
dtype: float64
splits:
- name: train
num_examples: 16790
- config_name: documents
features:
- name: document_text
dtype: string
- name: doc_int_id
dtype: int32
- name: chunk_count
dtype: int32
- name: document_token_count
dtype: int32
- name: chunk_char_starts
sequence: int32
- name: chunk_token_starts
sequence: int32
- name: chunk_token_counts
sequence: int32
- name: has_long_chunk
dtype: bool
- name: has_segmentation_error
dtype: bool
- name: content_ids
sequence: int64
- name: tone_ids
sequence: int64
- name: document_ids
sequence: int64
- name: entity_ids
sequence: int64
splits:
- name: train
num_examples: 14868862
- config_name: chunks
features:
- name: doc_int_id
dtype: int32
- name: chunk_id
dtype: int16
- name: chunk_text
dtype: string
- name: chunk_token_start
dtype: int32
- name: chunk_token_end
dtype: int32
- name: chunk_token_count
dtype: int32
- name: chunk_status
dtype: string
- name: tone_ids
sequence: int64
- name: entity_ids
sequence: int64
- name: content_ids
sequence: int64
- name: document_ids
sequence: int64
splits:
- name: train
num_examples: 95486049
- config_name: field_guide
features:
- name: concept_id
dtype: int32
- name: doc_int_id
dtype: int32
- name: chunk_id
dtype: int16
splits:
- name: train
num_examples: 1406432869
---
# FineWeb Atlas (v0.1)
**FineWeb Atlas** annotates 14.9 million [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) documents (95.5M chunks, 10.2B tokens) with **16,790 human-readable concepts** spanning entities, topics, tones, and document types. Each chunk receives ~15 concept labels on average. The release includes chunk- and document-level annotations, a concept metadata table with prevalence stats, a reverse index for concept-first retrieval, and a packed cooccurrence matrix.
For background on how the atlas was built, see the companion blog post: **[The FineWeb Concept Atlas](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/)**.
## Quick start
```python
from datasets import load_dataset
repo_id = "guidelabs/fineweb-atlas"
# Concept metadata and prevalence stats (small enough to load fully)
concepts = load_dataset(repo_id, "concepts", split="train").to_pandas()
concept_name_by_id = dict(zip(concepts["concept_id"], concepts["name"]))
# One row per document (streaming recommended for the full dataset)
documents = load_dataset(repo_id, "documents", split="train", streaming=True)
first_doc = next(iter(documents))
# Original chunk-level rows
chunks = load_dataset(repo_id, "chunks", split="train", streaming=True)
first_chunk = next(iter(chunks))
```
Resolve concept IDs to human-readable names:
```python
def names_for_ids(ids, lookup):
return [lookup[i] for i in ids]
# Example: show content concept names for the first chunk
print(names_for_ids(first_chunk["content_ids"], concept_name_by_id))
```
Load the full `documents` config into a pandas DataFrame locally. Prefer `streaming=True` unless you have ample RAM; the full `chunks` config is heavier.
```python
from datasets import load_dataset
repo_id = "guidelabs/fineweb-atlas"
docs_df = load_dataset(repo_id, "documents", split="train").to_pandas()
```
### Loading the cooccurrence matrix
The cooccurrence matrix is stored as a numpy file and is not loadable via `load_dataset`. Download it directly from the dataset repository:
```python
import numpy as np
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="guidelabs/fineweb-atlas",
filename="fineweb-concept-cooccurrence-matrix/fineweb-atlas-cooccurrence-upper-uint32.npy",
repo_type="dataset",
)
packed = np.load(path, mmap_mode="r")
# Look up cooccurrence count for concepts i and j (where i <= j)
n = 16790
def cooc(i, j):
if i > j:
i, j = j, i
return int(packed[i * n - (i * (i - 1)) // 2 + (j - i)])
# Example: how often do concepts 0 and 1 co-occur?
print(cooc(0, 1))
```
## Available configs
| Config | `load_dataset` call | Rows | Description |
|---|---|---|---|
| **`concepts`** (default) | `load_dataset("guidelabs/fineweb-atlas", "concepts")` | 16,790 | Concept metadata table with name, description, type, taxonomy, and prevalence stats. |
| **`documents`** | `load_dataset("guidelabs/fineweb-atlas", "documents")` | 14,868,862 | One row per document with full text, chunk character offsets, and document-level union concept lists by type. |
| **`chunks`** | `load_dataset("guidelabs/fineweb-atlas", "chunks")` | 95,486,049 | One row per chunk with chunk text, token span/count, quality status, and concept ID lists by type. |
| **`field_guide`** | `load_dataset("guidelabs/fineweb-atlas", "field_guide")` | 1,406,432,869 | One row per (concept_id, doc_int_id, chunk_id) assignment for concept-first retrieval. |
The **cooccurrence matrix** is a numpy file and must be loaded separately (see [Loading the cooccurrence matrix](#loading-the-cooccurrence-matrix) above).
All configs support `streaming=True`, which is recommended for the larger configs (`documents`, `chunks`, `field_guide`).
### Which config should I use?
- Start with `concepts` if you want to browse the concept inventory, descriptions, prevalence, or taxonomy.
- Use `documents` if you need full-document text plus document-level union labels.
- Use `chunks` if you need per-chunk labels, token spans, or chunk-local context.
- Use `field_guide` if you need reverse lookup from a concept to matching chunks.
## Shared IDs and joins
- `concept_id` is shared across all artifacts and is contiguous over `0..16789` (`16790` concepts total).
- `doc_int_id` is the shared document key across `documents`, `chunks`, and `field_guide`.
- `chunk_id` identifies a chunk within a document and joins `chunks` to `field_guide` when paired with `doc_int_id`.
- `document_text` in `documents` is reconstructed by directly concatenating the ordered chunk texts. The chunker preserves inter-chunk whitespace as a prefix on each subsequent chunk, so direct concatenation is lossless.
## Schemas
### `documents`
One row per document.
| Column | Type | Description |
|---|---|---|
| `document_text` | `string` | Full document text (direct concatenation of chunk texts, identical to the original FineWeb source). |
| `doc_int_id` | `int32` | Document key, shared across all configs. |
| `chunk_count` | `int32` | Number of chunks in this document. |
| `document_token_count` | `int32` | Total tokens across all chunks. |
| `chunk_char_starts` | `list<int32>` | Character offset where each chunk begins in `document_text`. Chunk *i* spans `document_text[chunk_char_starts[i]:chunk_char_starts[i+1]]` (last chunk runs to end of string). |
| `chunk_token_starts` | `list<int32>` | Token offset where each chunk begins in the document's token sequence. |
| `chunk_token_counts` | `list<int32>` | Number of tokens in each chunk. |
| `has_long_chunk` | `bool` | Whether any chunk exceeded the 128-token target. |
| `has_segmentation_error` | `bool` | Whether sentence packing failed (entire document is one chunk). |
| `content_ids` | `list<int64>` | Union of content concept IDs across all chunks (sorted, deduplicated). |
| `tone_ids` | `list<int64>` | Union of tone concept IDs across all chunks (sorted, deduplicated). |
| `document_ids` | `list<int64>` | Union of document-type concept IDs across all chunks (sorted, deduplicated). |
| `entity_ids` | `list<int64>` | Union of entity concept IDs across all chunks (sorted, deduplicated). |
### `chunks`
One row per chunk.
| Column | Type | Description |
|---|---|---|
| `doc_int_id` | `int32` | Parent document key. |
| `chunk_id` | `int16` | Chunk index within the document (0-based). |
| `chunk_text` | `string` | Text content of this chunk. |
| `chunk_token_start` | `int32` | Inclusive token offset in the document. |
| `chunk_token_end` | `int32` | Exclusive token offset in the document. |
| `chunk_token_count` | `int32` | Number of tokens (`chunk_token_end - chunk_token_start`). |
| `chunk_status` | `string` | Quality status: `ok`, `long_chunk`, or `segmentation_error`. |
| `tone_ids` | `list<int64>` | Tone concept IDs assigned to this chunk. |
| `entity_ids` | `list<int64>` | Entity concept IDs assigned to this chunk. |
| `content_ids` | `list<int64>` | Content/topic concept IDs assigned to this chunk. |
| `document_ids` | `list<int64>` | Document-type concept IDs assigned to this chunk. |
### `field_guide`
One row per (concept, chunk) assignment. Useful for "which chunks mention concept X?" queries.
| Column | Type | Description |
|---|---|---|
| `concept_id` | `int32` | Concept identifier (0..16789). |
| `doc_int_id` | `int32` | Document key. |
| `chunk_id` | `int16` | Chunk index within the document. |
### `concepts`
One row per concept.
| Column | Type | Description |
|---|---|---|
| `concept_id` | `uint32` | Concept identifier (0..16789). |
| `concept_type` | `string` | One of `entity`, `tone`, `content`, `document`. |
| `name` | `string` | Human-readable concept name. |
| `description` | `string` | Short description of the concept. |
| `taxonomy_lcc_path_primary` | `string` | Top-level Library of Congress Classification class (single letter, e.g. `H`, `T`, `Q`) or `"None"`. |
| `chunk_count` | `int64` | Number of chunks this concept appears in. |
| `chunk_prevalence` | `float64` | Fraction of all chunks containing this concept. |
### Cooccurrence matrix
Packed upper-triangular cooccurrence counts (with diagonal) over `concept_id`.
| File | Description |
|---|---|
| `fineweb-atlas-cooccurrence-upper-uint32.npy` | Numpy array, dtype `uint32`, length 140,960,445 (= n*(n+1)/2, n=16790). |
| `fineweb-atlas-cooccurrence-upper-uint32.json` | Metadata (dtype, dimensions, max value). |
Index mapping for `0 <= i <= j < n`:
```python
idx = i * n - (i * (i - 1)) // 2 + (j - i)
value = packed[idx] # equals C[i, j]
```
The diagonal entry `C[i, i]` is the number of chunks containing concept `i`, which is also available directly as `chunk_count` in the `concepts` config.
## Dataset statistics
| Metric | Value |
|---|---|
| Documents | 14,868,862 |
| Chunks | 95,486,049 |
| Total tokens | 10,183,028,973 |
| Concepts | 16,790 |
| Reverse-index rows | 1,406,432,869 |
| Avg. labels per chunk | 14.73 |
| Avg. chunks per document | 6.42 |
| Avg. tokens per document | 684.86 |
| Median chunks per document | 4 |
**Labels per chunk by type:**
| Type | Avg. per chunk | Share of all labels | Concept count |
|---|---|---|---|
| Tone | 6.90 | 46.8% | 587 |
| Content | 5.68 | 38.6% | 12,786 |
| Document | 1.52 | 10.3% | 31 |
| Entity | 0.63 | 4.3% | 3,386 |
**Chunk quality:** 97.5% `ok`, 2.5% `long_chunk`, 8 `segmentation_error`.
- **`ok`**: chunk is within the 128-token target.
- **`long_chunk`**: the sentence packer could not split the chunk further without breaking a sentence. These chunks exceed 128 tokens (median 171, max 54,408). Common causes are long unbroken paragraphs or code blocks.
- **`segmentation_error`**: sentence packing failed entirely and the whole document was placed into a single unsplit chunk. Only 8 documents (out of 14.9M) are affected — mostly very large spam pages or documents with unusual Unicode that confused the sentence splitter. These documents have concept annotations, but for the larger ones the labels may only reflect the beginning of the text due to annotation model context limits.
## How it was built
FineWeb Atlas was produced using the [ATLAS pipeline](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/):
1. **Chunking.** FineWeb documents were split into chunks of ~128 tokens.
2. **LLM annotation.** Each chunk was annotated with structured concept tags by a language model, producing raw labels across four types: content topics, entities, tones, and document types.
3. **Concept consolidation.** Raw tags were embedded, clustered, and deduplicated into a canonical concept library of 16,790 concepts. Each concept received a human-readable name, description, and Library of Congress Classification path.
4. **Scalable prediction.** A trained set-prediction model applied the consolidated concept library to all 95.5M chunks.
5. **Quality filtering.** An LLM judge scored concept assignments, and a filtering policy determined final keep/drop decisions per concept-chunk pair.
For full details, see the [blog post](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/).
## Common concepts: consider filtering
This release intentionally keeps very common labels, especially among `tone` and `document` concepts. They are useful for completeness, retrieval, and descriptive analysis, but they can dominate cooccurrence counts and downstream features if you are trying to surface more specific concepts.
In this release:
- `21` tone labels appear in at least `7%` of chunks.
- `8` document labels appear in at least `7%` of chunks.
- The most common overall labels are:
- `matter-of-fact` (83.6%)
- `Informational` (79.8%)
- `Brief and clear` (69.0%)
- `factual` (51.5%)
- `neutral` (34.5%)
- The most common document label is `Short announcement/bulletin` (29.7%).
If you are training a downstream model, building sparse features, or trying to highlight more discriminative concepts, a reasonable first pass is to drop concepts above a prevalence threshold:
```python
from datasets import load_dataset
repo_id = "guidelabs/fineweb-atlas"
concepts = load_dataset(repo_id, "concepts", split="train").to_pandas()
common_ids = set(concepts.loc[concepts["chunk_prevalence"] >= 0.07, "concept_id"])
def drop_common(ids):
return [cid for cid in ids if cid not in common_ids]
```
The right threshold depends on the use case; 7% is a simple starting point.
## Taxonomy distribution
Concepts are classified using the [Library of Congress Classification](https://www.loc.gov/catdir/cpso/lcco/) system. The top-level class distribution:
| Value | Domain | Concepts | Share |
|---|---|---|---|
| `H` | Social sciences | 3,895 | 23.2% |
| `T` | Technology | 1,959 | 11.7% |
| `G` | Geography, anthropology, recreation | 1,792 | 10.7% |
| `B` | Philosophy, psychology, religion | 1,332 | 7.9% |
| `R` | Medicine | 1,076 | 6.4% |
| `N` | Fine arts | 1,051 | 6.3% |
| `Q` | Science | 856 | 5.1% |
| `None` | No primary class assigned | 813 | 4.8% |
| `J` | Political science | 682 | 4.1% |
| `K` | Law | 539 | 3.2% |
| `D` | World history | 496 | 3.0% |
| `L` | Education | 491 | 2.9% |
| `P` | Language and literature | 476 | 2.8% |
| `E` | History of the USA | 314 | 1.9% |
| `S` | Agriculture | 247 | 1.5% |
| `C` | Auxiliary sciences of history | 195 | 1.2% |
| `A` | General works, reference | 190 | 1.1% |
| `Z` | Bibliography, library science | 144 | 0.9% |
| `U` | Military science | 119 | 0.7% |
| `F` | History of the Americas | 108 | 0.6% |
| `V` | Naval science | 15 | 0.1% |
These are the complete set of values for `taxonomy_lcc_path_primary` in the `concepts` config.
## Relationship to FineWeb
This dataset annotates the official 10B-token subsample of [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (`sample/10BT`), covering all 14.9M documents in that subsample. `doc_int_id` is a 1-based integer assigned sequentially during chunking and is stable within this release. It is not a FineWeb row index, but `document_text` in the `documents` config is identical to the original FineWeb source text and can be used for matching.
## Notes and caveats
- **Machine-generated labels.** Concept assignments include noise, especially for rarer concepts.
- **Source text quality.** Some chunk text is noisy or garbled from the original web scrape; concept quality depends on chunk quality.
- **No PII filtering.** This dataset inherits the personal information characteristics of FineWeb. Web-scraped text may contain names, emails, or other personal information.
- **English only.** Annotations were produced for English text. Non-English passages that appear in FineWeb may have unreliable labels.
- **`documents` is derived.** The `documents` config is built by concatenating chunks. The text is identical to the FineWeb source, but for per-chunk detail (individual chunk statuses, per-chunk concept lists) use the `chunks` config.
## Versioning
This is `v0.1`, the initial research release. The concept inventory, annotation model, and filtering policy may be revised in future versions. Breaking changes (e.g., concept ID renumbering) will increment the version.
## Suggested citation
If you use this dataset, please cite:
```bibtex
@misc{monson_fineweb_concept_atlas_2026,
author = {Nathaniel Monson},
title = {The FineWeb Concept Atlas},
year = {2026},
howpublished = {\url{https://www.guidelabs.ai/post/the-fineweb-concept-atlas/}},
note = {Guide Labs}
}
```
提供机构:
guidelabs



