five

mjbommar/curriculum-001-pretrain

收藏
Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mjbommar/curriculum-001-pretrain
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-generation - fill-mask language: - en size_categories: - 100K<n<1M license: cc-by-4.0 --- # Curriculum Training Data - PRETRAIN This dataset contains 364,848 records for pretrain training. ## Dataset Statistics - **Total Records**: 364,848 - **Train**: 346,605 records - **Validation**: 18,243 records ## Schema ```json { "text": "string", "source": "string", "char_count": "int64", "metadata": "string (JSON - source-specific fields)" } ``` ## Example Record ```json { "text": "# Côa (Q14653)\n\nCôa (Q14653) is a river in northern Portugal that ultimately feeds the Douro, a major river system in the region. It runs about 140 kilometers in length and drains a watershed of 2,521 square kilometers within the Douro drainage basin. The river’s course extends roughly between 41.0809°N, -7.1047°W and 40.2748°N, -6.9245°W, a path that traces a southward arc across parts of the Portuguese landscape and connects diverse ecosystems along its banks. In this sense, it is a significant contributor to the hydrology of the northern Iberian peninsula, playing a part in the broader network of rivers that shape the region’s geography.\n\nTwo tributaries feed the Côa, designated by the Wikidata identifiers Q10362237 and Q10362318, which collect rainfall and runoff from the surrounding lands. These streams combine with the main river’s flow as it continues toward the Douro, and in due course the waters enter the Douro proper. Through this connection, the Côa helps su... ``` ## Data Sources - `lexicon`: ~4,098 records (sampled) - `encyclopedias`: ~2,291 records (sampled) - `alea_legal`: ~1,511 records (sampled) - `questions`: ~1,299 records (sampled) - `drafts`: ~438 records (sampled) - `wikidata_samples`: ~108 records (sampled) - `math`: ~60 records (sampled) - `relationships`: ~50 records (sampled) - `chapters`: ~49 records (sampled) - `strategy`: ~47 records (sampled) ## Usage ```python from datasets import load_dataset import json dataset = load_dataset('mjbommar/curriculum-001-pretrain') train_data = dataset['train'] val_data = dataset['validation'] # Filter by source (promoted to top-level for easy filtering) lexicon_data = train_data.filter(lambda x: x['source'] == 'lexicon') alea_data = train_data.filter(lambda x: x['source'] == 'alea_legal') # Access source-specific metadata (stored as JSON) for record in train_data.select(range(10)): extra_metadata = json.loads(record['metadata']) print(f"Source: {record['source']}, Chars: {record['char_count']}") ``` ## Schema Notes - **Top-level fields** (`source`, `char_count`): Universal fields promoted for easy filtering/sorting - **metadata field**: JSON string containing source-specific fields (varies by source) - This structure enables efficient filtering while maintaining source-specific details ## License CC-BY-4.0
提供机构:
mjbommar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作