HCAI-Lab/dolma3_olmo3_corpus_manifest
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/HCAI-Lab/dolma3_olmo3_corpus_manifest
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: doc_id
dtype: string
- name: shard_path
dtype: string
- name: source_family
dtype: string
- name: weborganizer_topic
dtype: string
- name: weborganizer_format
dtype: string
- name: bin_id
dtype: int32
- name: estimated_token_count
dtype: int64
- name: token_count
dtype: int64
- name: document_length_chars
dtype: int64
- name: original_word_count
dtype: int64
- name: language
dtype: string
- name: year
dtype: int32
- name: url_domain
dtype: string
- name: topic_url_label_id
dtype: uint8
- name: topic_url
dtype: string
- name: topic_url_confidence
dtype: float32
- name: topic_url_probs
sequence: float32
- name: topic_nourl_label_id
dtype: uint8
- name: topic_nourl
dtype: string
- name: topic_nourl_confidence
dtype: float32
- name: topic_nourl_probs
sequence: float32
- name: format_url_label_id
dtype: uint8
- name: format_url
dtype: string
- name: format_url_confidence
dtype: float32
- name: format_url_probs
sequence: float32
- name: format_nourl_label_id
dtype: uint8
- name: format_nourl
dtype: string
- name: format_nourl_confidence
dtype: float32
- name: format_nourl_probs
sequence: float32
- name: quality_label_id
dtype: uint8
- name: quality_score
dtype: float32
- name: quality_high_prob
dtype: float32
- name: quality_low_prob
dtype: float32
- name: quality_confidence
dtype: float32
- name: sidecar_schema_version
dtype: string
configs:
- config_name: default
data_files: "data/*.parquet"
size_categories:
- 1B<n<10B
tags:
- olmo
- dolma
- weborganizer
- corpus-manifest
- data-attribution
license: odc-by
---
# OLMo3 Corpus Manifest (SOC-95)
Metadata-only manifest for the OLMo3 training corpus. One row per document, no text payload.
## Corpus statistics
| Property | Value |
|----------|-------|
| Total documents | 1,098,646,162 |
| Total tokens | 2,116,727,590,753 |
| Source families | common_crawl (95.8%), olmocr_science_pdfs (4.2%) |
| Mean quality score | 0.096 |
| Median tokens/doc | 768 |
| Mean tokens/doc | 1,927 |
| Quality coverage | 99.92% |
| WebOrganizer topics | 24 categories |
| WebOrganizer formats | 24 categories |
## Schema
Each row represents one document in the OLMo3 training pool with:
- **Identity:** , ,
- **Content stats:** , , ,
- **Temporal:** , ,
- **WebOrganizer labels:** topic and format classifications from both URL and no-URL classifiers, including top-1 labels, confidences, and full 24-class probability vectors
- **Quality:** binary quality label, quality score, high/low probabilities, confidence
- **Sampling:** (topic x format bin identifier for stratified sampling)
## Source
Built from the deduplicated Dolma3 training pool (~6T tokens, 58,360 source shards) with WebOrganizer sidecar labels produced by the SOC-91 classification pipeline.
## Interactive explorer
Browse the corpus analysis interactively: [SOC-95 Corpus Explorer](https://huggingface.co/spaces/HCAI-Lab/soc95-corpus-explorer)
提供机构:
HCAI-Lab



