youmyron/bits-py-dataset

Name: youmyron/bits-py-dataset
Creator: youmyron
Published: 2026-03-30 08:55:49
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/youmyron/bits-py-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: bits-py canonical dataset license: mit language: - en tags: - python - code - instruction-following - autotrain - lora - dataset - fastapi - pandas - numpy - machine-learning task_categories: - text-generation size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: train.jsonl - split: validation path: validation.jsonl - split: test path: test.jsonl --- # bits-py canonical dataset Revision label: `v2026-03-30-r2` This is the first published Hugging Face dataset revision for the `bits-py` adapter project. It packages the canonical supervised fine-tuning corpus used to train a lightweight Python-specialist LoRA on top of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. ## Project goal Train a lightweight adapter that improves the base model on practical Python work, especially: - clean Pythonic code - data pipelines - pandas / numpy transforms - FastAPI endpoints - async workflows - ML training and inference scripts ## Repo - Hugging Face dataset repo: `youmyron/bits-py-dataset` - URL: <https://huggingface.co/datasets/youmyron/bits-py-dataset> - Published revision label for this package: `v2026-03-30-r2` ## Intended use - primary use: supervised fine-tuning (`llm-sft`) - preferred trainer flow: Hugging Face AutoTrain Advanced - chat template: `tokenizer` - expected text column: `messages` - intended downstream model family: `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` - intended artifact: lightweight LoRA/QLoRA-style adapter for fast model swapping ## Dataset layout ```text README.md manifest.json autotrain-config.yaml train.jsonl validation.jsonl test.jsonl ``` ## Split counts - train: **8143** - validation: **1003** - test: **962** - total: **10108** ## Source mix ### By source kind - raw: **10000** examples (**98.9%**) - curated: **108** examples (**1.1%**) ### By source file - `python_code_instructions_18k_alpaca.jsonl`: **5000** - `self_oss_instruct_sc2_exec_filter_50k.jsonl`: **5000** - `domain_gap_boost_v1.jsonl` (curated): **108** Ignored during canonical build: - `codefeedback_filtered_instruction.jsonl` (empty) - `magicoder_oss_instruct_75k.jsonl` (empty) ## Topic coverage - `general_python`: **7459** (**73.8%**) - `pandas_numpy`: **1506** (**14.9%**) - `data_pipelines`: **1357** (**13.4%**) - `ml_scripts`: **483** (**4.8%**) - `async_fastapi`: **77** (**0.8%**) Notes: - examples can carry more than one topic label, so topic totals can exceed the dataset total - the curated augmentation was added to improve domain coverage for async/FastAPI, pandas/numpy, and ML workflow tasks ## Schema Each JSONL row contains: ```json { "id": "python_code_instructions_18k_alpaca:17", "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "split": "train", "topics": ["pandas_numpy"], "source_file": "python_code_instructions_18k_alpaca.jsonl", "source_record": 17, "metadata": { "raw_source_file": "python_code_instructions_18k_alpaca.jsonl", "raw_record": 17, "source_format": "sharegpt", "canonical_split": "train" } } ``` Validation rules used for packaging: - required fields: `id`, `split`, `messages`, `metadata`, `source`, `topics` - first message role must be `user` - last message role must be `assistant` - message contents must be non-empty - ids must be unique across splits - packaged split counts must match the canonical manifest ## Data preparation notes - split assignment is deterministic - exact duplicate removal removed **24** records - near-duplicate removal removed **0** records - canonical build seed: `42` - target validation ratio: `0.1` - target test ratio: `0.1` ## Limitations - this is a narrow task-oriented code dataset, not a general reasoning benchmark - topic balance is intentionally uneven; `general_python` dominates the corpus - async/FastAPI coverage improved with curated augmentation but remains relatively small - records come from a mixture of harvested and curated instruction data, so style and difficulty are not perfectly uniform - the dataset is intended for adapter training and internal evaluation, not as a claim of broad software engineering completeness or safety - outputs generated from models trained on this dataset still require normal code review, testing, and security scrutiny before production use ## AutoTrain mapping Recommended AutoTrain dataset settings: - task/trainer: `llm-sft` - train split: `train` - valid split: `validation` - chat template: `tokenizer` - column mapping: `text_column -> messages` The packaged `autotrain-config.yaml` in this repo matches the intended lightweight adapter training flow. ## Revisioning This package corresponds to canonical dataset revision `v2026-03-30-r2`. For future revisions: 1. rebuild canonical data with `python3 scripts/build_canonical_dataset.py` 2. package the next revision with `python3 scripts/package_hf_dataset.py --dataset-version <next-version>` 3. validate with `python3 scripts/package_hf_dataset.py --dataset-version <next-version> --dry-run` 4. upload the new package contents to the same dataset repo 5. update this card so the revision label, counts, source mix, and topic coverage match the new package exactly

提供机构：

youmyron

5,000+

优质数据集

54 个

任务类型

进入经典数据集