five

HCAI-Lab/dolma3_olmo3_corpus_manifest

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/HCAI-Lab/dolma3_olmo3_corpus_manifest
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: doc_id dtype: string - name: shard_path dtype: string - name: source_family dtype: string - name: weborganizer_topic dtype: string - name: weborganizer_format dtype: string - name: bin_id dtype: int32 - name: estimated_token_count dtype: int64 - name: token_count dtype: int64 - name: document_length_chars dtype: int64 - name: original_word_count dtype: int64 - name: language dtype: string - name: year dtype: int32 - name: url_domain dtype: string - name: topic_url_label_id dtype: uint8 - name: topic_url dtype: string - name: topic_url_confidence dtype: float32 - name: topic_url_probs sequence: float32 - name: topic_nourl_label_id dtype: uint8 - name: topic_nourl dtype: string - name: topic_nourl_confidence dtype: float32 - name: topic_nourl_probs sequence: float32 - name: format_url_label_id dtype: uint8 - name: format_url dtype: string - name: format_url_confidence dtype: float32 - name: format_url_probs sequence: float32 - name: format_nourl_label_id dtype: uint8 - name: format_nourl dtype: string - name: format_nourl_confidence dtype: float32 - name: format_nourl_probs sequence: float32 - name: quality_label_id dtype: uint8 - name: quality_score dtype: float32 - name: quality_high_prob dtype: float32 - name: quality_low_prob dtype: float32 - name: quality_confidence dtype: float32 - name: sidecar_schema_version dtype: string configs: - config_name: default data_files: "data/*.parquet" size_categories: - 1B<n<10B tags: - olmo - dolma - weborganizer - corpus-manifest - data-attribution license: odc-by --- # OLMo3 Corpus Manifest (SOC-95) Metadata-only manifest for the OLMo3 training corpus. One row per document, no text payload. ## Corpus statistics | Property | Value | |----------|-------| | Total documents | 1,098,646,162 | | Total tokens | 2,116,727,590,753 | | Source families | common_crawl (95.8%), olmocr_science_pdfs (4.2%) | | Mean quality score | 0.096 | | Median tokens/doc | 768 | | Mean tokens/doc | 1,927 | | Quality coverage | 99.92% | | WebOrganizer topics | 24 categories | | WebOrganizer formats | 24 categories | ## Schema Each row represents one document in the OLMo3 training pool with: - **Identity:** , , - **Content stats:** , , , - **Temporal:** , , - **WebOrganizer labels:** topic and format classifications from both URL and no-URL classifiers, including top-1 labels, confidences, and full 24-class probability vectors - **Quality:** binary quality label, quality score, high/low probabilities, confidence - **Sampling:** (topic x format bin identifier for stratified sampling) ## Source Built from the deduplicated Dolma3 training pool (~6T tokens, 58,360 source shards) with WebOrganizer sidecar labels produced by the SOC-91 classification pipeline. ## Interactive explorer Browse the corpus analysis interactively: [SOC-95 Corpus Explorer](https://huggingface.co/spaces/HCAI-Lab/soc95-corpus-explorer)
提供机构:
HCAI-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作