five

youmyron/bits-py-dataset

收藏
Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/youmyron/bits-py-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: bits-py canonical dataset license: mit language: - en tags: - python - code - instruction-following - autotrain - lora - dataset - fastapi - pandas - numpy - machine-learning task_categories: - text-generation size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: train.jsonl - split: validation path: validation.jsonl - split: test path: test.jsonl --- # bits-py canonical dataset Revision label: `v2026-03-30-r2` This is the first published Hugging Face dataset revision for the `bits-py` adapter project. It packages the canonical supervised fine-tuning corpus used to train a lightweight Python-specialist LoRA on top of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. ## Project goal Train a lightweight adapter that improves the base model on practical Python work, especially: - clean Pythonic code - data pipelines - pandas / numpy transforms - FastAPI endpoints - async workflows - ML training and inference scripts ## Repo - Hugging Face dataset repo: `youmyron/bits-py-dataset` - URL: <https://huggingface.co/datasets/youmyron/bits-py-dataset> - Published revision label for this package: `v2026-03-30-r2` ## Intended use - primary use: supervised fine-tuning (`llm-sft`) - preferred trainer flow: Hugging Face AutoTrain Advanced - chat template: `tokenizer` - expected text column: `messages` - intended downstream model family: `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B` - intended artifact: lightweight LoRA/QLoRA-style adapter for fast model swapping ## Dataset layout ```text README.md manifest.json autotrain-config.yaml train.jsonl validation.jsonl test.jsonl ``` ## Split counts - train: **8143** - validation: **1003** - test: **962** - total: **10108** ## Source mix ### By source kind - raw: **10000** examples (**98.9%**) - curated: **108** examples (**1.1%**) ### By source file - `python_code_instructions_18k_alpaca.jsonl`: **5000** - `self_oss_instruct_sc2_exec_filter_50k.jsonl`: **5000** - `domain_gap_boost_v1.jsonl` (curated): **108** Ignored during canonical build: - `codefeedback_filtered_instruction.jsonl` (empty) - `magicoder_oss_instruct_75k.jsonl` (empty) ## Topic coverage - `general_python`: **7459** (**73.8%**) - `pandas_numpy`: **1506** (**14.9%**) - `data_pipelines`: **1357** (**13.4%**) - `ml_scripts`: **483** (**4.8%**) - `async_fastapi`: **77** (**0.8%**) Notes: - examples can carry more than one topic label, so topic totals can exceed the dataset total - the curated augmentation was added to improve domain coverage for async/FastAPI, pandas/numpy, and ML workflow tasks ## Schema Each JSONL row contains: ```json { "id": "python_code_instructions_18k_alpaca:17", "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "split": "train", "topics": ["pandas_numpy"], "source_file": "python_code_instructions_18k_alpaca.jsonl", "source_record": 17, "metadata": { "raw_source_file": "python_code_instructions_18k_alpaca.jsonl", "raw_record": 17, "source_format": "sharegpt", "canonical_split": "train" } } ``` Validation rules used for packaging: - required fields: `id`, `split`, `messages`, `metadata`, `source`, `topics` - first message role must be `user` - last message role must be `assistant` - message contents must be non-empty - ids must be unique across splits - packaged split counts must match the canonical manifest ## Data preparation notes - split assignment is deterministic - exact duplicate removal removed **24** records - near-duplicate removal removed **0** records - canonical build seed: `42` - target validation ratio: `0.1` - target test ratio: `0.1` ## Limitations - this is a narrow task-oriented code dataset, not a general reasoning benchmark - topic balance is intentionally uneven; `general_python` dominates the corpus - async/FastAPI coverage improved with curated augmentation but remains relatively small - records come from a mixture of harvested and curated instruction data, so style and difficulty are not perfectly uniform - the dataset is intended for adapter training and internal evaluation, not as a claim of broad software engineering completeness or safety - outputs generated from models trained on this dataset still require normal code review, testing, and security scrutiny before production use ## AutoTrain mapping Recommended AutoTrain dataset settings: - task/trainer: `llm-sft` - train split: `train` - valid split: `validation` - chat template: `tokenizer` - column mapping: `text_column -> messages` The packaged `autotrain-config.yaml` in this repo matches the intended lightweight adapter training flow. ## Revisioning This package corresponds to canonical dataset revision `v2026-03-30-r2`. For future revisions: 1. rebuild canonical data with `python3 scripts/build_canonical_dataset.py` 2. package the next revision with `python3 scripts/package_hf_dataset.py --dataset-version <next-version>` 3. validate with `python3 scripts/package_hf_dataset.py --dataset-version <next-version> --dry-run` 4. upload the new package contents to the same dataset repo 5. update this card so the revision label, counts, source mix, and topic coverage match the new package exactly
提供机构:
youmyron
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作