five

pomazanbohdan/rustforge-personal-rust-dataset

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/pomazanbohdan/rustforge-personal-rust-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en pretty_name: RustForge Personal Rust Dataset license: other task_categories: - text-generation tags: - rust - code - synthetic - instruction - chatml - unsloth - sft size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: data/train-*.jsonl --- # RustForge Personal Rust Dataset Current version: `0.5.4` Target Hub repo: `pomazanbohdan/rustforge-personal-rust-dataset` Source repository: - [pomazanbohdan/llm-model-rust-test](https://github.com/pomazanbohdan/llm-model-rust-test) ## Scale - total records: `56,000` - format: ChatML-style `messages` - storage: sharded JSONL ## Methodology This dataset is built as a Rust-specific, benchmark-aligned training corpus rather than a generic code dump. Construction principles: - Rust edition target is fixed to `2024` - examples are generated by task family, not only by category count - the corpus is designed around real failure modes from a Rust evaluation suite - records are emitted in one unified ChatML-compatible format for direct SFT use The dataset generation workflow in the repository follows these stages: 1. Define a Rust task mix across compile repair, semantic implementation, bugfix, edition migration, async, unsafe, macros, API refactor, doctest, and Cargo/workspace tasks. 2. Generate sharded dataset rows from editable batch manifests under `hf-source/`. 3. Assign each generated row to an explicit `family_id` so quality can be tracked per template family. 4. Run normalized deduplication to measure semantic repetition. 5. Run tiered validation: - `cheap`: `cargo check` + `cargo fmt --check` - `medium`: adds `cargo clippy -- -D warnings` and `cargo test --no-run` - `full`: adds `cargo doc --no-deps` and `cargo test --doc` 6. Run family-first cascade and parallel depth validation to expand verified coverage without exhaustively validating every row. 7. Build smaller, higher-confidence priority subsets for lower-cost fine-tuning workflows. ## Design Goal The goal is to train Rust-specialized coding models that perform well on: - compile correctness - semantic correctness - Rust 2024 migration behavior - async and concurrency patterns - unsafe and FFI boundaries - macro updates - API refactoring - doctest and Cargo workspace maintenance ## Intended Uses Recommended uses: - supervised fine-tuning of Rust-oriented coding models - continued domain adaptation of general coding models toward Rust 2024 - curriculum construction for repair, migration, and maintenance-style Rust tasks - building smaller high-confidence subsets for lower-cost training This dataset is especially suited for models that should edit or generate Rust code in a crate or workspace context rather than only answer general programming questions. ## Limitations - the current release is predominantly synthetic, although it is benchmark-aligned - the corpus is fully audited in the current release, but it is still predominantly synthetic rather than mined from production repositories - some quality signals are tracked at the template-family level rather than by exhaustively validating every row - the dataset is optimized for modern Rust application and library workflows, not for every possible Rust domain such as embedded, `no_std`, kernel, or GPU-specific development For stricter training mixes, use the repository tooling to build validated or priority subsets on top of the canonical corpus. ## Quality Snapshot Current quality work recorded in the repository includes: - family-based generation with explicit `family_id` - normalized semantic deduplication - tiered validation across `cheap`, `medium`, and `full` execution gates - exact-id tail fills to close the active-corpus audit gap - current-only reports for dataset quality and family depth - optimized priority-train builders for lower-cost fine-tuning Current snapshot: - canonical corpus size: `56,000` rows - current audited rows on the rebuilt corpus: `56,000` - failed audited rows: `0` - stable audited families: `56/56` - global family-depth floor: `1000` - the active corpus is now fully audited end-to-end - all `13` dataset categories are currently at `A+` - unique semantic keys: `55,100` - semantic uniqueness rate: `98.39%` Reference reports: - [Current dataset status](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-dataset-status.md) - [Current family depth](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-family-depth.md) - [Current dataset diversity](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-dataset-diversity.md) ## Category mix | Category | Count | | --- | ---: | | api_refactor | 4000 | | async_concurrency_fix | 6000 | | cargo_workspace_fix | 5000 | | clippy_fmt_cleanup | 3000 | | compile_repair | 5000 | | doctest_doc_fix | 3000 | | edition2024_migration | 6000 | | macro_fix | 3000 | | review_preference | 1000 | | rust_qa | 3000 | | semantic_impl | 8000 | | test_driven_bugfix | 4000 | | unsafe_ffi_fix | 5000 | ## Unsloth compatibility Use `messages` as the conversation field. If a UI asks for pairwise mapping, use `prompt` and `completion`. ## Loading in Unsloth This dataset is structured to work with ChatML-style training flows. Typical loading path: 1. Load the dataset from Hugging Face. 2. Use the `messages` column as the chat conversation field. 3. Keep `prompt` and `completion` only as convenience columns for inspection or alternative trainers. Example with `datasets`: ```python from datasets import load_dataset dataset = load_dataset( "pomazanbohdan/rustforge-personal-rust-dataset", split="train", ) print(dataset.column_names) print(dataset[0]["messages"]) ``` If your Unsloth or trainer UI expects an instruction-response mapping instead of chat messages, use: - input: `prompt` - output: `completion` ## Repository Layout The canonical implementation lives in: - [repository root](https://github.com/pomazanbohdan/llm-model-rust-test) - [`hf-source/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/hf-source) - [`scripts/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/scripts) - [`reports/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/reports)
提供机构:
pomazanbohdan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作