mjbommar/oglm-curriculum-pretrain
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/oglm-curriculum-pretrain
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-generation
- fill-mask
language:
- en
tags:
- education
- curriculum
- pretraining
- lexicon
- dictionary
- opengloss
- synthetic
size_categories:
- 100K<n<1M
---
# OGLM Curriculum Pretraining Dataset
High-quality educational text data for language model pretraining, derived from
the [OpenGloss](https://huggingface.co/datasets/mjbommar/opengloss-dictionary)
synthetic encyclopedic dictionary and related curriculum materials.
## Background
This dataset is derived from **OpenGloss**, a synthetic encyclopedic dictionary
and semantic knowledge graph for English that integrates lexicographic definitions,
encyclopedic context, etymological histories, and semantic relationships in a
unified resource. OpenGloss contains 537K senses across 150K lexemes, with 9.1M
semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic
content.
For more details on the source data, see the paper:
[OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph](https://arxiv.org/abs/2511.18622)
## Dataset Description
This dataset contains **453,409 records** with **483,664,021 total words**
(~483.7M words), averaging **1067 words per record**.
### Content Types
| Formatter | Records | Percentage |
|-----------|---------|------------|
| lexicon_rich | 290,275 | 64.0% |
| wikidata_encyclopedia | 86,920 | 19.2% |
| question_rich | 34,597 | 7.6% |
| reasoning_rich | 19,333 | 4.3% |
| artifact_rich | 13,353 | 2.9% |
| wikidata_sample | 3,384 | 0.7% |
| relationship_rich | 2,109 | 0.5% |
| strategy_rich | 1,417 | 0.3% |
| artifact_text | 743 | 0.2% |
| draft_rewrite_rich | 409 | 0.1% |
| chapter_list | 203 | 0.0% |
| course_rich | 167 | 0.0% |
| curriculum_document | 102 | 0.0% |
| concepts_objectives | 100 | 0.0% |
| chapter_resources | 64 | 0.0% |
| chapter_differentiation | 41 | 0.0% |
| chapter_activity | 41 | 0.0% |
| chapter_figure | 39 | 0.0% |
| chapter_assessment | 32 | 0.0% |
| chapter_full | 30 | 0.0% |
| draft_instruction_pair | 22 | 0.0% |
| lesson | 21 | 0.0% |
| chapter_generic | 7 | 0.0% |
### Data Fields
- `text` (string): The formatted educational content
- `source` (string): Source file path for provenance
- `formatter` (string): Which formatter produced this record
- `word_count` (int): Number of words in the text
### Splits
- **train**: 448,874 records (~99%)
- **validation**: 4,534 records (~1%)
## Content Overview
### Lexicon Entries (lexicon_rich)
Dictionary-style entries with:
- Multiple parts of speech (noun, verb, adjective, etc.)
- Detailed definitions with examples
- Synonyms, antonyms, related terms
- Etymology and usage notes
- Semantic relationships (broader/narrower terms)
### Educational Articles (artifact_rich)
Wikidata-grounded educational content including:
- Biographies of notable figures
- Historical analyses
- Scientific explanations
- Geographic and cultural information
### Course Materials (course_rich)
Full curriculum plans with:
- Learning objectives
- Unit breakdowns
- Essential questions
- Performance tasks and assessments
### Instructional Content (draft_instruction_pair)
Structured educational content with:
- Clear instructions
- Step-by-step explanations
- Practice examples
## Usage
```python
from datasets import load_dataset
# Load the dataset
ds = load_dataset("mjbommar/oglm-curriculum-pretrain")
# Access splits
train_data = ds["train"]
val_data = ds["validation"]
# Example: iterate over records
for record in train_data:
text = record["text"]
source = record["source"]
formatter = record["formatter"]
word_count = record["word_count"]
```
### Filtering by Formatter
```python
# Get only lexicon entries
lexicon_data = ds["train"].filter(lambda x: x["formatter"] == "lexicon_rich")
# Get only long-form content (>500 words)
long_content = ds["train"].filter(lambda x: x["word_count"] > 500)
```
### Streaming Large Datasets
```python
# Stream without downloading entire dataset
ds = load_dataset("mjbommar/oglm-curriculum-pretrain", streaming=True)
for record in ds["train"]:
# Process record
pass
```
## Data Quality
- **Minimum word count**: ~186 words (all records are substantial)
- **Maximum word count**: ~3,300 words (complex educational concepts)
- **Median word count**: ~1,050 words
- **No empty or near-empty content**
- **Consistent formatting structure**
- **Rich semantic information**
## Processing Pipeline
This dataset was created using a custom formatting pipeline that:
1. Reads structured JSON curriculum data
2. Auto-detects schema type based on field presence
3. Applies appropriate formatter for each schema
4. Streams output to JSONL with periodic flushing
5. Tracks statistics and word counts
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
## Citation
If you use this dataset, please cite the OpenGloss paper:
```bibtex
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Michael J. Bommarito II},
year={2025},
eprint={2511.18622},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.18622},
}
```
## Related Resources
- **OpenGloss Dictionary**: [mjbommar/opengloss-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-dictionary)
- **Paper**: [arXiv:2511.18622](https://arxiv.org/abs/2511.18622)
## Contact
For questions or issues, please open a discussion on the
[dataset page](https://huggingface.co/datasets/mjbommar/oglm-curriculum-pretrain/discussions).
许可证: CC-BY-4.0
任务类别:
- 文本生成(text-generation)
- 掩码填充(fill-mask)
语言:
- 英语(en)
标签:
- 教育(education)
- 课程(curriculum)
- 预训练(pretraining)
- 词典(lexicon)
- 字典(dictionary)
- OpenGloss(opengloss)
- 合成(synthetic)
规模类别:
- 10万<n<100万(100K<n<1M)
# OGLM课程预训练数据集
面向大语言模型预训练的高质量教育文本数据,源自[OpenGloss(OpenGloss)](https://huggingface.co/datasets/mjbommar/opengloss-dictionary)合成百科词典及相关课程资料。
## 背景
本数据集源自**OpenGloss**——一款面向英语的合成百科词典与语义知识图谱,将词典释义、百科语境、词源历史与语义关系整合为统一的知识资源。OpenGloss包含15万个词位(lexemes)下的53.7万个义项,拥有910万条语义边、100万条使用示例、300万条搭配以及6000万字的百科内容。
如需了解源数据的更多细节,请参阅论文:[OpenGloss:一款合成百科词典与语义知识图谱(OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph)](https://arxiv.org/abs/2511.18622)
## 数据集描述
本数据集共包含**453409条记录**,总词量达**483664021词(约4.837亿词)**,单条记录平均词量为**1067词**。
### 内容类型
| 格式化器(Formatter) | 记录数 | 占比 |
|-----------|---------|------------|
| 富词典格式(lexicon_rich) | 290275 | 64.0% |
| 维基数据百科类(wikidata_encyclopedia) | 86920 | 19.2% |
| 富问答格式(question_rich) | 34597 | 7.6% |
| 富推理格式(reasoning_rich) | 19333 | 4.3% |
| 富素材格式(artifact_rich) | 13353 | 2.9% |
| 维基数据样本(wikidata_sample) | 3384 | 0.7% |
| 富关系格式(relationship_rich) | 2109 | 0.5% |
| 富策略格式(strategy_rich) | 1417 | 0.3% |
| 纯素材文本(artifact_text) | 743 | 0.2% |
| 富改写草稿格式(draft_rewrite_rich) | 409 | 0.1% |
| 章节列表(chapter_list) | 203 | 0.0% |
| 富课程格式(course_rich) | 167 | 0.0% |
| 课程文档(curriculum_document) | 102 | 0.0% |
| 概念与目标(concepts_objectives) | 100 | 0.0% |
| 章节资源(chapter_resources) | 64 | 0.0% |
| 章节差异化设计(chapter_differentiation) | 41 | 0.0% |
| 章节活动(chapter_activity) | 41 | 0.0% |
| 章节图表(chapter_figure) | 39 | 0.0% |
| 章节测评(chapter_assessment) | 32 | 0.0% |
| 完整章节(chapter_full) | 30 | 0.0% |
| 草稿指令对(draft_instruction_pair) | 22 | 0.0% |
| 课程(lesson) | 21 | 0.0% |
| 通用章节(chapter_generic) | 7 | 0.0% |
### 数据字段
- `text`(字符串类型):格式化后的教育文本内容
- `source`(字符串类型):用于溯源的源文件路径
- `formatter`(字符串类型):生成当前记录的格式化器
- `word_count`(整数类型):当前文本的词量
### 数据集划分
- **训练集(train)**:448874条记录(占比约99%)
- **验证集(validation)**:4534条记录(占比约1%)
## 内容概览
### 词典条目(lexicon_rich)
词典风格的条目包含以下内容:
- 多词性标注(名词、动词、形容词等)
- 带示例的详细释义
- 同义词、反义词及相关词汇
- 词源与使用说明
- 语义关系(上位词/下位词)
### 教育文章(artifact_rich)
基于维基数据的教育内容,涵盖:
- 知名人物传记
- 历史分析
- 科学阐释
- 地理与文化资讯
### 课程资料(course_rich)
完整的课程计划,包含:
- 学习目标
- 单元划分
- 核心问题
- 实践任务与测评
### 教学内容(draft_instruction_pair)
结构化教育内容,包含:
- 清晰的操作指引
- 分步阐释
- 练习示例
## 使用方法
python
from datasets import load_dataset
# Load the dataset
ds = load_dataset("mjbommar/oglm-curriculum-pretrain")
# Access splits
train_data = ds["train"]
val_data = ds["validation"]
# Example: iterate over records
for record in train_data:
text = record["text"]
source = record["source"]
formatter = record["formatter"]
word_count = record["word_count"]
### 按格式化器筛选
python
# Get only lexicon entries
lexicon_data = ds["train"].filter(lambda x: x["formatter"] == "lexicon_rich")
# Get only long-form content (>500 words)
long_content = ds["train"].filter(lambda x: x["word_count"] > 500)
### 大型数据集流式加载
python
# Stream without downloading entire dataset
ds = load_dataset("mjbommar/oglm-curriculum-pretrain", streaming=True)
for record in ds["train"]:
# Process record
pass
## 数据质量
- **最小词量**:约186词(所有记录均具备足够长度)
- **最大词量**:约3300词(对应复杂教育主题)
- **词量中位数**:约1050词
- **无空内容或近乎空的内容**
- **格式结构统一规范**
- **语义信息丰富完整**
## 处理流程
本数据集通过自定义格式化流程生成,具体步骤如下:
1. 读取结构化JSON课程数据
2. 根据字段存在情况自动识别数据模式(schema)类型
3. 为每种数据模式匹配对应的格式化器
4. 以周期性刷新的方式将输出流式写入JSONL文件
5. 统计数据量与词数信息
## 许可证
本数据集基于[知识共享署名4.0协议(CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/)发布。
## 引用声明
如使用本数据集,请引用以下OpenGloss论文:
bibtex
@misc{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Michael J. Bommarito II},
year={2025},
eprint={2511.18622},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.18622},
}
## 相关资源
- **OpenGloss词典**:[mjbommar/opengloss-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-dictionary)
- **论文**:[arXiv:2511.18622](https://arxiv.org/abs/2511.18622)
## 联系方式
如有疑问或问题,请在[数据集页面](https://huggingface.co/datasets/mjbommar/oglm-curriculum-pretrain/discussions)发起讨论。
提供机构:
mjbommar



