pomazanbohdan/rustforge-personal-rust-dataset
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/pomazanbohdan/rustforge-personal-rust-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: RustForge Personal Rust Dataset
license: other
task_categories:
- text-generation
tags:
- rust
- code
- synthetic
- instruction
- chatml
- unsloth
- sft
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.jsonl
---
# RustForge Personal Rust Dataset
Current version: `0.5.4`
Target Hub repo: `pomazanbohdan/rustforge-personal-rust-dataset`
Source repository:
- [pomazanbohdan/llm-model-rust-test](https://github.com/pomazanbohdan/llm-model-rust-test)
## Scale
- total records: `56,000`
- format: ChatML-style `messages`
- storage: sharded JSONL
## Methodology
This dataset is built as a Rust-specific, benchmark-aligned training corpus rather than a generic code dump.
Construction principles:
- Rust edition target is fixed to `2024`
- examples are generated by task family, not only by category count
- the corpus is designed around real failure modes from a Rust evaluation suite
- records are emitted in one unified ChatML-compatible format for direct SFT use
The dataset generation workflow in the repository follows these stages:
1. Define a Rust task mix across compile repair, semantic implementation, bugfix, edition migration, async, unsafe, macros, API refactor, doctest, and Cargo/workspace tasks.
2. Generate sharded dataset rows from editable batch manifests under `hf-source/`.
3. Assign each generated row to an explicit `family_id` so quality can be tracked per template family.
4. Run normalized deduplication to measure semantic repetition.
5. Run tiered validation:
- `cheap`: `cargo check` + `cargo fmt --check`
- `medium`: adds `cargo clippy -- -D warnings` and `cargo test --no-run`
- `full`: adds `cargo doc --no-deps` and `cargo test --doc`
6. Run family-first cascade and parallel depth validation to expand verified coverage without exhaustively validating every row.
7. Build smaller, higher-confidence priority subsets for lower-cost fine-tuning workflows.
## Design Goal
The goal is to train Rust-specialized coding models that perform well on:
- compile correctness
- semantic correctness
- Rust 2024 migration behavior
- async and concurrency patterns
- unsafe and FFI boundaries
- macro updates
- API refactoring
- doctest and Cargo workspace maintenance
## Intended Uses
Recommended uses:
- supervised fine-tuning of Rust-oriented coding models
- continued domain adaptation of general coding models toward Rust 2024
- curriculum construction for repair, migration, and maintenance-style Rust tasks
- building smaller high-confidence subsets for lower-cost training
This dataset is especially suited for models that should edit or generate Rust code in a crate or workspace context rather than only answer general programming questions.
## Limitations
- the current release is predominantly synthetic, although it is benchmark-aligned
- the corpus is fully audited in the current release, but it is still predominantly synthetic rather than mined from production repositories
- some quality signals are tracked at the template-family level rather than by exhaustively validating every row
- the dataset is optimized for modern Rust application and library workflows, not for every possible Rust domain such as embedded, `no_std`, kernel, or GPU-specific development
For stricter training mixes, use the repository tooling to build validated or priority subsets on top of the canonical corpus.
## Quality Snapshot
Current quality work recorded in the repository includes:
- family-based generation with explicit `family_id`
- normalized semantic deduplication
- tiered validation across `cheap`, `medium`, and `full` execution gates
- exact-id tail fills to close the active-corpus audit gap
- current-only reports for dataset quality and family depth
- optimized priority-train builders for lower-cost fine-tuning
Current snapshot:
- canonical corpus size: `56,000` rows
- current audited rows on the rebuilt corpus: `56,000`
- failed audited rows: `0`
- stable audited families: `56/56`
- global family-depth floor: `1000`
- the active corpus is now fully audited end-to-end
- all `13` dataset categories are currently at `A+`
- unique semantic keys: `55,100`
- semantic uniqueness rate: `98.39%`
Reference reports:
- [Current dataset status](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-dataset-status.md)
- [Current family depth](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-family-depth.md)
- [Current dataset diversity](https://github.com/pomazanbohdan/llm-model-rust-test/blob/main/reports/current-dataset-diversity.md)
## Category mix
| Category | Count |
| --- | ---: |
| api_refactor | 4000 |
| async_concurrency_fix | 6000 |
| cargo_workspace_fix | 5000 |
| clippy_fmt_cleanup | 3000 |
| compile_repair | 5000 |
| doctest_doc_fix | 3000 |
| edition2024_migration | 6000 |
| macro_fix | 3000 |
| review_preference | 1000 |
| rust_qa | 3000 |
| semantic_impl | 8000 |
| test_driven_bugfix | 4000 |
| unsafe_ffi_fix | 5000 |
## Unsloth compatibility
Use `messages` as the conversation field. If a UI asks for pairwise mapping, use `prompt` and `completion`.
## Loading in Unsloth
This dataset is structured to work with ChatML-style training flows.
Typical loading path:
1. Load the dataset from Hugging Face.
2. Use the `messages` column as the chat conversation field.
3. Keep `prompt` and `completion` only as convenience columns for inspection or alternative trainers.
Example with `datasets`:
```python
from datasets import load_dataset
dataset = load_dataset(
"pomazanbohdan/rustforge-personal-rust-dataset",
split="train",
)
print(dataset.column_names)
print(dataset[0]["messages"])
```
If your Unsloth or trainer UI expects an instruction-response mapping instead of chat messages, use:
- input: `prompt`
- output: `completion`
## Repository Layout
The canonical implementation lives in:
- [repository root](https://github.com/pomazanbohdan/llm-model-rust-test)
- [`hf-source/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/hf-source)
- [`scripts/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/scripts)
- [`reports/`](https://github.com/pomazanbohdan/llm-model-rust-test/tree/main/reports)
提供机构:
pomazanbohdan



