AlgorithmicResearchGroup/ai-sft

Name: AlgorithmicResearchGroup/ai-sft
Creator: AlgorithmicResearchGroup
Published: 2026-04-11 22:42:30
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/AlgorithmicResearchGroup/ai-sft

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: AI SFT Dataset language: - en license: other task_categories: - text-generation size_categories: - 1M<n<10M tags: - supervised-fine-tuning - research - code - papers - ai-research configs: - config_name: default data_files: - split: train path: train.parquet - split: validation path: val.parquet - split: full path: full.parquet --- # AI SFT Dataset A unified supervised fine-tuning dataset built from public [Algorithmic Research Group](https://algorithmicresearchgroup.com/opensource.html) Hugging Face sources. Designed for training models capable of AI research reasoning, this dataset aggregates instruction-following examples spanning research code generation, scientific QA, and technical problem solving. ## Dataset Summary | Statistic | Count | |-----------|-------| | Total records | 2,729,918 | | Train split | 2,593,122 | | Validation split | 136,796 | | Rejected records | 199,678 | ## Dataset Structure ### Files | File | Description | |------|-------------| | `train.parquet` | Training split | | `val.parquet` | Validation split | | `full.parquet` | Canonical fields plus helper columns | | `canonical.parquet` | Public schema only | | `rejected.parquet` | Rejected rows with reasons and raw source payload | | `stats.json` | Build statistics | | `mixture_recipe.yaml` | Mixture recipe configuration | ### Canonical Fields | Field | Type | Description | |-------|------|-------------| | `example_id` | string | Unique identifier for each example | | `task_family` | string | Category of the task (e.g., `research_code_generation`) | | `instruction` | string | Task instruction | | `context` | string | Additional context (nullable) | | `choices` | string | Multiple choice options (nullable) | | `target` | string | Target/expected output | | `target_format` | string | Format of the target (e.g., `python`, `text`) | | `grounded` | int64 | Whether the example is grounded in source material | | `source_dataset` | string | Source dataset name | | `source_keys` | string | Keys from source data | | `loss_weight` | float64 | Weight for loss computation | ### Helper Columns (full export) | Field | Description | |-------|-------------| | `split` | Data split identifier | | `root_id` | Root identifier | | `rendered_input` | Rendered input text | | `quality_flags` | Quality assessment flags | ## Usage ```python from datasets import load_dataset ds = load_dataset("AlgorithmicResearchGroup/ai-sft", split="train") # or stream ds = load_dataset("AlgorithmicResearchGroup/ai-sft", streaming=True, split="train") for sample in ds: print(sample["task_family"], sample["instruction"][:100]) break ``` ## Source Built from public datasets in the [AlgorithmicResearchGroup](https://huggingface.co/AlgorithmicResearchGroup) Hugging Face organization, including [ArXivDLInstruct](https://huggingface.co/datasets/AlgorithmicResearchGroup/ArXivDLInstruct) and other research-focused collections. ## Citation ```bibtex @misc{ai_sft_2024, title={AI SFT Dataset}, author={Algorithmic Research Group}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/AlgorithmicResearchGroup/ai-sft} } ```

提供机构：

AlgorithmicResearchGroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集