five

AlgorithmicResearchGroup/ai-sft

收藏
Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AlgorithmicResearchGroup/ai-sft
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: AI SFT Dataset language: - en license: other task_categories: - text-generation size_categories: - 1M<n<10M tags: - supervised-fine-tuning - research - code - papers - ai-research configs: - config_name: default data_files: - split: train path: train.parquet - split: validation path: val.parquet - split: full path: full.parquet --- # AI SFT Dataset A unified supervised fine-tuning dataset built from public [Algorithmic Research Group](https://algorithmicresearchgroup.com/opensource.html) Hugging Face sources. Designed for training models capable of AI research reasoning, this dataset aggregates instruction-following examples spanning research code generation, scientific QA, and technical problem solving. ## Dataset Summary | Statistic | Count | |-----------|-------| | Total records | 2,729,918 | | Train split | 2,593,122 | | Validation split | 136,796 | | Rejected records | 199,678 | ## Dataset Structure ### Files | File | Description | |------|-------------| | `train.parquet` | Training split | | `val.parquet` | Validation split | | `full.parquet` | Canonical fields plus helper columns | | `canonical.parquet` | Public schema only | | `rejected.parquet` | Rejected rows with reasons and raw source payload | | `stats.json` | Build statistics | | `mixture_recipe.yaml` | Mixture recipe configuration | ### Canonical Fields | Field | Type | Description | |-------|------|-------------| | `example_id` | string | Unique identifier for each example | | `task_family` | string | Category of the task (e.g., `research_code_generation`) | | `instruction` | string | Task instruction | | `context` | string | Additional context (nullable) | | `choices` | string | Multiple choice options (nullable) | | `target` | string | Target/expected output | | `target_format` | string | Format of the target (e.g., `python`, `text`) | | `grounded` | int64 | Whether the example is grounded in source material | | `source_dataset` | string | Source dataset name | | `source_keys` | string | Keys from source data | | `loss_weight` | float64 | Weight for loss computation | ### Helper Columns (full export) | Field | Description | |-------|-------------| | `split` | Data split identifier | | `root_id` | Root identifier | | `rendered_input` | Rendered input text | | `quality_flags` | Quality assessment flags | ## Usage ```python from datasets import load_dataset ds = load_dataset("AlgorithmicResearchGroup/ai-sft", split="train") # or stream ds = load_dataset("AlgorithmicResearchGroup/ai-sft", streaming=True, split="train") for sample in ds: print(sample["task_family"], sample["instruction"][:100]) break ``` ## Source Built from public datasets in the [AlgorithmicResearchGroup](https://huggingface.co/AlgorithmicResearchGroup) Hugging Face organization, including [ArXivDLInstruct](https://huggingface.co/datasets/AlgorithmicResearchGroup/ArXivDLInstruct) and other research-focused collections. ## Citation ```bibtex @misc{ai_sft_2024, title={AI SFT Dataset}, author={Algorithmic Research Group}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/AlgorithmicResearchGroup/ai-sft} } ```
提供机构:
AlgorithmicResearchGroup
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作