mjbommar/curriculum-001-pretrain
收藏Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/curriculum-001-pretrain
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
- fill-mask
language:
- en
size_categories:
- 100K<n<1M
license: cc-by-4.0
---
# Curriculum Training Data - PRETRAIN
This dataset contains 364,848 records for pretrain training.
## Dataset Statistics
- **Total Records**: 364,848
- **Train**: 346,605 records
- **Validation**: 18,243 records
## Schema
```json
{
"text": "string",
"source": "string",
"char_count": "int64",
"metadata": "string (JSON - source-specific fields)"
}
```
## Example Record
```json
{
"text": "# Côa (Q14653)\n\nCôa (Q14653) is a river in northern Portugal that ultimately feeds the Douro, a major river system in the region. It runs about 140 kilometers in length and drains a watershed of 2,521 square kilometers within the Douro drainage basin. The river’s course extends roughly between 41.0809°N, -7.1047°W and 40.2748°N, -6.9245°W, a path that traces a southward arc across parts of the Portuguese landscape and connects diverse ecosystems along its banks. In this sense, it is a significant contributor to the hydrology of the northern Iberian peninsula, playing a part in the broader network of rivers that shape the region’s geography.\n\nTwo tributaries feed the Côa, designated by the Wikidata identifiers Q10362237 and Q10362318, which collect rainfall and runoff from the surrounding lands. These streams combine with the main river’s flow as it continues toward the Douro, and in due course the waters enter the Douro proper. Through this connection, the Côa helps su...
```
## Data Sources
- `lexicon`: ~4,098 records (sampled)
- `encyclopedias`: ~2,291 records (sampled)
- `alea_legal`: ~1,511 records (sampled)
- `questions`: ~1,299 records (sampled)
- `drafts`: ~438 records (sampled)
- `wikidata_samples`: ~108 records (sampled)
- `math`: ~60 records (sampled)
- `relationships`: ~50 records (sampled)
- `chapters`: ~49 records (sampled)
- `strategy`: ~47 records (sampled)
## Usage
```python
from datasets import load_dataset
import json
dataset = load_dataset('mjbommar/curriculum-001-pretrain')
train_data = dataset['train']
val_data = dataset['validation']
# Filter by source (promoted to top-level for easy filtering)
lexicon_data = train_data.filter(lambda x: x['source'] == 'lexicon')
alea_data = train_data.filter(lambda x: x['source'] == 'alea_legal')
# Access source-specific metadata (stored as JSON)
for record in train_data.select(range(10)):
extra_metadata = json.loads(record['metadata'])
print(f"Source: {record['source']}, Chars: {record['char_count']}")
```
## Schema Notes
- **Top-level fields** (`source`, `char_count`): Universal fields promoted for easy filtering/sorting
- **metadata field**: JSON string containing source-specific fields (varies by source)
- This structure enables efficient filtering while maintaining source-specific details
## License
CC-BY-4.0
提供机构:
mjbommar



