five

NickIBrody/python-code-instructions-85k

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/NickIBrody/python-code-instructions-85k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other task_categories: - text-generation language: - en tags: - code - python - instruction-tuning - code-generation - alpaca size_categories: - 10K<n<100K pretty_name: Python Code Instructions - 85K Improved Release --- # Python Code Instructions - 85K Instruction-tuning dataset of Python functions paired with short natural-language instructions derived from repository docstrings. ## What changed in this release This release keeps the original public rows and format, but makes the dataset easier to use responsibly: - exact duplicate rows were removed again using normalized `instruction + output` hashing - deterministic `train`, `validation`, and `test` splits were added - the dataset card now documents provenance limitations and licensing risk more explicitly ## Dataset Summary | Field | Value | |---|---| | Source rows downloaded | 85,903 | | Rows after exact deduplication | 85,903 | | Exact duplicates dropped in this pass | 0 | | Train rows | 83,323 | | Validation rows | 1,747 | | Test rows | 833 | | Format | JSONL (`instruction`, `output`, `system`) | | Primary language | English instructions, Python code | ## Format Each example contains: ```json { "instruction": "Format a duration for status output.", "output": "def format_elapsed(seconds: float) -> str:\n ...", "system": "As a Python code expert, you are capable of creating scripts from specifications." } ``` ## Recommended Use - supervised fine-tuning for Python code generation - lightweight instruction-following experiments - data augmentation alongside stronger code corpora with provenance metadata ## Important Limitations - `instruction` values are derived from the first sentence of function docstrings, so this is closer to docstring-to-code supervision than to real user prompts - examples are single functions only and often depend on surrounding repository context that is not included - the `system` field is constant across rows; many training pipelines can inject it in the prompt template instead of consuming it from every sample - validation and test splits are deterministic hash splits, not repository-level decontaminated benchmarks ## Provenance And Licensing This dataset was assembled from public GitHub repositories, but the released rows do not currently include per-example provenance fields such as repository, file path, commit, or source license. Because of that, this release should **not** be treated as a clean MIT-licensed dataset. The repository-level licensing status of individual examples may vary, and downstream users should perform their own legal review before production or commercial use. ## Fields Missing For Serious Research The current release still lacks: - per-example repository / file / commit provenance - per-example source license metadata - repository-level split isolation - contamination analysis against downstream benchmarks ## Split Construction Splits are derived deterministically from a SHA-256 hash of normalized `instruction + output`: - `test`: 1% - `validation`: 2% - `train`: 97% This keeps the split stable across rebuilds from the same released file. ## Basic Stats - average instruction length: 56.18 characters - average output length: 696.41 characters ## Safer Next Steps To make this dataset genuinely strong rather than just serviceable, the next rebuild should add: 1. repository, path, commit, function name, and license fields for each row 2. repository-level deduplication and split assignment 3. explicit filtering for generated code, vendored code, and test fixtures 4. a published extraction script for reproducibility
提供机构:
NickIBrody
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作