five

mjbommar/curriculum-001-sft

收藏
Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/curriculum-001-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation - question-answering language: - en size_categories: - 100K<n<1M license: cc-by-4.0 --- # Curriculum Training Data - SFT This dataset contains 983,217 records for sft training. ## Dataset Statistics - **Total Records**: 983,217 - **Train**: 786,573 records - **Validation**: 98,322 records - **Test**: 98,322 records ## Schema ```json { "text": "string", "source": "string", "char_count": "int64", "metadata": "string (JSON - source-specific fields)" } ``` ## Example Record ```json { "prompt": "What are antonyms for 'math book edition'?", "completion": "nonmath edition, general edition", "metadata": { "source": "lexicon", "task_type": "antonyms", "word": "math book edition", "file": "math_book_edition.json", "prompt_chars": 42, "completion_chars": 32 } } ``` ## Data Sources - `lexicon`: ~8,982 records (sampled) - `alea_legal`: ~460 records (sampled) - `questions`: ~307 records (sampled) - `drafts`: ~139 records (sampled) - `wikidata_samples`: ~33 records (sampled) - `relationships`: ~31 records (sampled) - `strategy`: ~14 records (sampled) - `wikidata_encyclopedias`: ~12 records (sampled) - `math`: ~10 records (sampled) - `courses`: ~8 records (sampled) ## Usage ```python from datasets import load_dataset import json dataset = load_dataset('mjbommar/curriculum-001-sft') train_data = dataset['train'] val_data = dataset['validation'] test_data = dataset['test'] # Filter by source (promoted to top-level for easy filtering) lexicon_data = train_data.filter(lambda x: x['source'] == 'lexicon') alea_data = train_data.filter(lambda x: x['source'] == 'alea_legal') # Access source-specific metadata (stored as JSON) for record in train_data.select(range(10)): extra_metadata = json.loads(record['metadata']) print(f"Source: {record['source']}, Chars: {record['char_count']}") ``` ## Schema Notes - **Top-level fields** (`source`, `char_count`): Universal fields promoted for easy filtering/sorting - **metadata field**: JSON string containing source-specific fields (varies by source) - This structure enables efficient filtering while maintaining source-specific details ## License CC-BY-4.0
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作