five

guidelabs/fineweb-atlas

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/guidelabs/fineweb-atlas
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: FineWeb Atlas (v0.1) language: - en license: odc-by task_categories: - text-classification - text-retrieval - feature-extraction size_categories: - 10M<n<100M annotations_creators: - machine-generated tags: - fineweb - concept-annotation - topic-modeling - text-mining - cooccurrence - taxonomy source_datasets: - HuggingFaceFW/fineweb configs: - config_name: concepts default: true data_files: - split: train path: fineweb-concept-atlas.parquet - config_name: documents data_files: - split: train path: fineweb-atlas-documents/*.parquet - config_name: chunks data_files: - split: train path: fineweb-atlas-annotated/*.parquet - config_name: field_guide data_files: - split: train path: fineweb-atlas-annotated-reverse-index/*.parquet dataset_info: - config_name: concepts features: - name: concept_id dtype: uint32 - name: concept_type dtype: string - name: name dtype: string - name: description dtype: string - name: taxonomy_lcc_path_primary dtype: string - name: chunk_count dtype: int64 - name: chunk_prevalence dtype: float64 splits: - name: train num_examples: 16790 - config_name: documents features: - name: document_text dtype: string - name: doc_int_id dtype: int32 - name: chunk_count dtype: int32 - name: document_token_count dtype: int32 - name: chunk_char_starts sequence: int32 - name: chunk_token_starts sequence: int32 - name: chunk_token_counts sequence: int32 - name: has_long_chunk dtype: bool - name: has_segmentation_error dtype: bool - name: content_ids sequence: int64 - name: tone_ids sequence: int64 - name: document_ids sequence: int64 - name: entity_ids sequence: int64 splits: - name: train num_examples: 14868862 - config_name: chunks features: - name: doc_int_id dtype: int32 - name: chunk_id dtype: int16 - name: chunk_text dtype: string - name: chunk_token_start dtype: int32 - name: chunk_token_end dtype: int32 - name: chunk_token_count dtype: int32 - name: chunk_status dtype: string - name: tone_ids sequence: int64 - name: entity_ids sequence: int64 - name: content_ids sequence: int64 - name: document_ids sequence: int64 splits: - name: train num_examples: 95486049 - config_name: field_guide features: - name: concept_id dtype: int32 - name: doc_int_id dtype: int32 - name: chunk_id dtype: int16 splits: - name: train num_examples: 1406432869 --- # FineWeb Atlas (v0.1) **FineWeb Atlas** annotates 14.9 million [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) documents (95.5M chunks, 10.2B tokens) with **16,790 human-readable concepts** spanning entities, topics, tones, and document types. Each chunk receives ~15 concept labels on average. The release includes chunk- and document-level annotations, a concept metadata table with prevalence stats, a reverse index for concept-first retrieval, and a packed cooccurrence matrix. For background on how the atlas was built, see the companion blog post: **[The FineWeb Concept Atlas](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/)**. ## Quick start ```python from datasets import load_dataset repo_id = "guidelabs/fineweb-atlas" # Concept metadata and prevalence stats (small enough to load fully) concepts = load_dataset(repo_id, "concepts", split="train").to_pandas() concept_name_by_id = dict(zip(concepts["concept_id"], concepts["name"])) # One row per document (streaming recommended for the full dataset) documents = load_dataset(repo_id, "documents", split="train", streaming=True) first_doc = next(iter(documents)) # Original chunk-level rows chunks = load_dataset(repo_id, "chunks", split="train", streaming=True) first_chunk = next(iter(chunks)) ``` Resolve concept IDs to human-readable names: ```python def names_for_ids(ids, lookup): return [lookup[i] for i in ids] # Example: show content concept names for the first chunk print(names_for_ids(first_chunk["content_ids"], concept_name_by_id)) ``` Load the full `documents` config into a pandas DataFrame locally. Prefer `streaming=True` unless you have ample RAM; the full `chunks` config is heavier. ```python from datasets import load_dataset repo_id = "guidelabs/fineweb-atlas" docs_df = load_dataset(repo_id, "documents", split="train").to_pandas() ``` ### Loading the cooccurrence matrix The cooccurrence matrix is stored as a numpy file and is not loadable via `load_dataset`. Download it directly from the dataset repository: ```python import numpy as np from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="guidelabs/fineweb-atlas", filename="fineweb-concept-cooccurrence-matrix/fineweb-atlas-cooccurrence-upper-uint32.npy", repo_type="dataset", ) packed = np.load(path, mmap_mode="r") # Look up cooccurrence count for concepts i and j (where i <= j) n = 16790 def cooc(i, j): if i > j: i, j = j, i return int(packed[i * n - (i * (i - 1)) // 2 + (j - i)]) # Example: how often do concepts 0 and 1 co-occur? print(cooc(0, 1)) ``` ## Available configs | Config | `load_dataset` call | Rows | Description | |---|---|---|---| | **`concepts`** (default) | `load_dataset("guidelabs/fineweb-atlas", "concepts")` | 16,790 | Concept metadata table with name, description, type, taxonomy, and prevalence stats. | | **`documents`** | `load_dataset("guidelabs/fineweb-atlas", "documents")` | 14,868,862 | One row per document with full text, chunk character offsets, and document-level union concept lists by type. | | **`chunks`** | `load_dataset("guidelabs/fineweb-atlas", "chunks")` | 95,486,049 | One row per chunk with chunk text, token span/count, quality status, and concept ID lists by type. | | **`field_guide`** | `load_dataset("guidelabs/fineweb-atlas", "field_guide")` | 1,406,432,869 | One row per (concept_id, doc_int_id, chunk_id) assignment for concept-first retrieval. | The **cooccurrence matrix** is a numpy file and must be loaded separately (see [Loading the cooccurrence matrix](#loading-the-cooccurrence-matrix) above). All configs support `streaming=True`, which is recommended for the larger configs (`documents`, `chunks`, `field_guide`). ### Which config should I use? - Start with `concepts` if you want to browse the concept inventory, descriptions, prevalence, or taxonomy. - Use `documents` if you need full-document text plus document-level union labels. - Use `chunks` if you need per-chunk labels, token spans, or chunk-local context. - Use `field_guide` if you need reverse lookup from a concept to matching chunks. ## Shared IDs and joins - `concept_id` is shared across all artifacts and is contiguous over `0..16789` (`16790` concepts total). - `doc_int_id` is the shared document key across `documents`, `chunks`, and `field_guide`. - `chunk_id` identifies a chunk within a document and joins `chunks` to `field_guide` when paired with `doc_int_id`. - `document_text` in `documents` is reconstructed by directly concatenating the ordered chunk texts. The chunker preserves inter-chunk whitespace as a prefix on each subsequent chunk, so direct concatenation is lossless. ## Schemas ### `documents` One row per document. | Column | Type | Description | |---|---|---| | `document_text` | `string` | Full document text (direct concatenation of chunk texts, identical to the original FineWeb source). | | `doc_int_id` | `int32` | Document key, shared across all configs. | | `chunk_count` | `int32` | Number of chunks in this document. | | `document_token_count` | `int32` | Total tokens across all chunks. | | `chunk_char_starts` | `list<int32>` | Character offset where each chunk begins in `document_text`. Chunk *i* spans `document_text[chunk_char_starts[i]:chunk_char_starts[i+1]]` (last chunk runs to end of string). | | `chunk_token_starts` | `list<int32>` | Token offset where each chunk begins in the document's token sequence. | | `chunk_token_counts` | `list<int32>` | Number of tokens in each chunk. | | `has_long_chunk` | `bool` | Whether any chunk exceeded the 128-token target. | | `has_segmentation_error` | `bool` | Whether sentence packing failed (entire document is one chunk). | | `content_ids` | `list<int64>` | Union of content concept IDs across all chunks (sorted, deduplicated). | | `tone_ids` | `list<int64>` | Union of tone concept IDs across all chunks (sorted, deduplicated). | | `document_ids` | `list<int64>` | Union of document-type concept IDs across all chunks (sorted, deduplicated). | | `entity_ids` | `list<int64>` | Union of entity concept IDs across all chunks (sorted, deduplicated). | ### `chunks` One row per chunk. | Column | Type | Description | |---|---|---| | `doc_int_id` | `int32` | Parent document key. | | `chunk_id` | `int16` | Chunk index within the document (0-based). | | `chunk_text` | `string` | Text content of this chunk. | | `chunk_token_start` | `int32` | Inclusive token offset in the document. | | `chunk_token_end` | `int32` | Exclusive token offset in the document. | | `chunk_token_count` | `int32` | Number of tokens (`chunk_token_end - chunk_token_start`). | | `chunk_status` | `string` | Quality status: `ok`, `long_chunk`, or `segmentation_error`. | | `tone_ids` | `list<int64>` | Tone concept IDs assigned to this chunk. | | `entity_ids` | `list<int64>` | Entity concept IDs assigned to this chunk. | | `content_ids` | `list<int64>` | Content/topic concept IDs assigned to this chunk. | | `document_ids` | `list<int64>` | Document-type concept IDs assigned to this chunk. | ### `field_guide` One row per (concept, chunk) assignment. Useful for "which chunks mention concept X?" queries. | Column | Type | Description | |---|---|---| | `concept_id` | `int32` | Concept identifier (0..16789). | | `doc_int_id` | `int32` | Document key. | | `chunk_id` | `int16` | Chunk index within the document. | ### `concepts` One row per concept. | Column | Type | Description | |---|---|---| | `concept_id` | `uint32` | Concept identifier (0..16789). | | `concept_type` | `string` | One of `entity`, `tone`, `content`, `document`. | | `name` | `string` | Human-readable concept name. | | `description` | `string` | Short description of the concept. | | `taxonomy_lcc_path_primary` | `string` | Top-level Library of Congress Classification class (single letter, e.g. `H`, `T`, `Q`) or `"None"`. | | `chunk_count` | `int64` | Number of chunks this concept appears in. | | `chunk_prevalence` | `float64` | Fraction of all chunks containing this concept. | ### Cooccurrence matrix Packed upper-triangular cooccurrence counts (with diagonal) over `concept_id`. | File | Description | |---|---| | `fineweb-atlas-cooccurrence-upper-uint32.npy` | Numpy array, dtype `uint32`, length 140,960,445 (= n*(n+1)/2, n=16790). | | `fineweb-atlas-cooccurrence-upper-uint32.json` | Metadata (dtype, dimensions, max value). | Index mapping for `0 <= i <= j < n`: ```python idx = i * n - (i * (i - 1)) // 2 + (j - i) value = packed[idx] # equals C[i, j] ``` The diagonal entry `C[i, i]` is the number of chunks containing concept `i`, which is also available directly as `chunk_count` in the `concepts` config. ## Dataset statistics | Metric | Value | |---|---| | Documents | 14,868,862 | | Chunks | 95,486,049 | | Total tokens | 10,183,028,973 | | Concepts | 16,790 | | Reverse-index rows | 1,406,432,869 | | Avg. labels per chunk | 14.73 | | Avg. chunks per document | 6.42 | | Avg. tokens per document | 684.86 | | Median chunks per document | 4 | **Labels per chunk by type:** | Type | Avg. per chunk | Share of all labels | Concept count | |---|---|---|---| | Tone | 6.90 | 46.8% | 587 | | Content | 5.68 | 38.6% | 12,786 | | Document | 1.52 | 10.3% | 31 | | Entity | 0.63 | 4.3% | 3,386 | **Chunk quality:** 97.5% `ok`, 2.5% `long_chunk`, 8 `segmentation_error`. - **`ok`**: chunk is within the 128-token target. - **`long_chunk`**: the sentence packer could not split the chunk further without breaking a sentence. These chunks exceed 128 tokens (median 171, max 54,408). Common causes are long unbroken paragraphs or code blocks. - **`segmentation_error`**: sentence packing failed entirely and the whole document was placed into a single unsplit chunk. Only 8 documents (out of 14.9M) are affected — mostly very large spam pages or documents with unusual Unicode that confused the sentence splitter. These documents have concept annotations, but for the larger ones the labels may only reflect the beginning of the text due to annotation model context limits. ## How it was built FineWeb Atlas was produced using the [ATLAS pipeline](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/): 1. **Chunking.** FineWeb documents were split into chunks of ~128 tokens. 2. **LLM annotation.** Each chunk was annotated with structured concept tags by a language model, producing raw labels across four types: content topics, entities, tones, and document types. 3. **Concept consolidation.** Raw tags were embedded, clustered, and deduplicated into a canonical concept library of 16,790 concepts. Each concept received a human-readable name, description, and Library of Congress Classification path. 4. **Scalable prediction.** A trained set-prediction model applied the consolidated concept library to all 95.5M chunks. 5. **Quality filtering.** An LLM judge scored concept assignments, and a filtering policy determined final keep/drop decisions per concept-chunk pair. For full details, see the [blog post](https://www.guidelabs.ai/post/the-fineweb-concept-atlas/). ## Common concepts: consider filtering This release intentionally keeps very common labels, especially among `tone` and `document` concepts. They are useful for completeness, retrieval, and descriptive analysis, but they can dominate cooccurrence counts and downstream features if you are trying to surface more specific concepts. In this release: - `21` tone labels appear in at least `7%` of chunks. - `8` document labels appear in at least `7%` of chunks. - The most common overall labels are: - `matter-of-fact` (83.6%) - `Informational` (79.8%) - `Brief and clear` (69.0%) - `factual` (51.5%) - `neutral` (34.5%) - The most common document label is `Short announcement/bulletin` (29.7%). If you are training a downstream model, building sparse features, or trying to highlight more discriminative concepts, a reasonable first pass is to drop concepts above a prevalence threshold: ```python from datasets import load_dataset repo_id = "guidelabs/fineweb-atlas" concepts = load_dataset(repo_id, "concepts", split="train").to_pandas() common_ids = set(concepts.loc[concepts["chunk_prevalence"] >= 0.07, "concept_id"]) def drop_common(ids): return [cid for cid in ids if cid not in common_ids] ``` The right threshold depends on the use case; 7% is a simple starting point. ## Taxonomy distribution Concepts are classified using the [Library of Congress Classification](https://www.loc.gov/catdir/cpso/lcco/) system. The top-level class distribution: | Value | Domain | Concepts | Share | |---|---|---|---| | `H` | Social sciences | 3,895 | 23.2% | | `T` | Technology | 1,959 | 11.7% | | `G` | Geography, anthropology, recreation | 1,792 | 10.7% | | `B` | Philosophy, psychology, religion | 1,332 | 7.9% | | `R` | Medicine | 1,076 | 6.4% | | `N` | Fine arts | 1,051 | 6.3% | | `Q` | Science | 856 | 5.1% | | `None` | No primary class assigned | 813 | 4.8% | | `J` | Political science | 682 | 4.1% | | `K` | Law | 539 | 3.2% | | `D` | World history | 496 | 3.0% | | `L` | Education | 491 | 2.9% | | `P` | Language and literature | 476 | 2.8% | | `E` | History of the USA | 314 | 1.9% | | `S` | Agriculture | 247 | 1.5% | | `C` | Auxiliary sciences of history | 195 | 1.2% | | `A` | General works, reference | 190 | 1.1% | | `Z` | Bibliography, library science | 144 | 0.9% | | `U` | Military science | 119 | 0.7% | | `F` | History of the Americas | 108 | 0.6% | | `V` | Naval science | 15 | 0.1% | These are the complete set of values for `taxonomy_lcc_path_primary` in the `concepts` config. ## Relationship to FineWeb This dataset annotates the official 10B-token subsample of [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (`sample/10BT`), covering all 14.9M documents in that subsample. `doc_int_id` is a 1-based integer assigned sequentially during chunking and is stable within this release. It is not a FineWeb row index, but `document_text` in the `documents` config is identical to the original FineWeb source text and can be used for matching. ## Notes and caveats - **Machine-generated labels.** Concept assignments include noise, especially for rarer concepts. - **Source text quality.** Some chunk text is noisy or garbled from the original web scrape; concept quality depends on chunk quality. - **No PII filtering.** This dataset inherits the personal information characteristics of FineWeb. Web-scraped text may contain names, emails, or other personal information. - **English only.** Annotations were produced for English text. Non-English passages that appear in FineWeb may have unreliable labels. - **`documents` is derived.** The `documents` config is built by concatenating chunks. The text is identical to the FineWeb source, but for per-chunk detail (individual chunk statuses, per-chunk concept lists) use the `chunks` config. ## Versioning This is `v0.1`, the initial research release. The concept inventory, annotation model, and filtering policy may be revised in future versions. Breaking changes (e.g., concept ID renumbering) will increment the version. ## Suggested citation If you use this dataset, please cite: ```bibtex @misc{monson_fineweb_concept_atlas_2026, author = {Nathaniel Monson}, title = {The FineWeb Concept Atlas}, year = {2026}, howpublished = {\url{https://www.guidelabs.ai/post/the-fineweb-concept-atlas/}}, note = {Guide Labs} } ```
提供机构:
guidelabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作