mjbommar/curriculum-001-sft
收藏Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/curriculum-001-sft
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
- question-answering
language:
- en
size_categories:
- 100K<n<1M
license: cc-by-4.0
---
# Curriculum Training Data - SFT
This dataset contains 983,217 records for sft training.
## Dataset Statistics
- **Total Records**: 983,217
- **Train**: 786,573 records
- **Validation**: 98,322 records
- **Test**: 98,322 records
## Schema
```json
{
"text": "string",
"source": "string",
"char_count": "int64",
"metadata": "string (JSON - source-specific fields)"
}
```
## Example Record
```json
{
"prompt": "What are antonyms for 'math book edition'?",
"completion": "nonmath edition, general edition",
"metadata": {
"source": "lexicon",
"task_type": "antonyms",
"word": "math book edition",
"file": "math_book_edition.json",
"prompt_chars": 42,
"completion_chars": 32
}
}
```
## Data Sources
- `lexicon`: ~8,982 records (sampled)
- `alea_legal`: ~460 records (sampled)
- `questions`: ~307 records (sampled)
- `drafts`: ~139 records (sampled)
- `wikidata_samples`: ~33 records (sampled)
- `relationships`: ~31 records (sampled)
- `strategy`: ~14 records (sampled)
- `wikidata_encyclopedias`: ~12 records (sampled)
- `math`: ~10 records (sampled)
- `courses`: ~8 records (sampled)
## Usage
```python
from datasets import load_dataset
import json
dataset = load_dataset('mjbommar/curriculum-001-sft')
train_data = dataset['train']
val_data = dataset['validation']
test_data = dataset['test']
# Filter by source (promoted to top-level for easy filtering)
lexicon_data = train_data.filter(lambda x: x['source'] == 'lexicon')
alea_data = train_data.filter(lambda x: x['source'] == 'alea_legal')
# Access source-specific metadata (stored as JSON)
for record in train_data.select(range(10)):
extra_metadata = json.loads(record['metadata'])
print(f"Source: {record['source']}, Chars: {record['char_count']}")
```
## Schema Notes
- **Top-level fields** (`source`, `char_count`): Universal fields promoted for easy filtering/sorting
- **metadata field**: JSON string containing source-specific fields (varies by source)
- This structure enables efficient filtering while maintaining source-specific details
## License
CC-BY-4.0
提供机构:
mjbommar



