Noushad999/ML-1M-Syntax-Validated-Python-Code

Name: Noushad999/ML-1M-Syntax-Validated-Python-Code
Creator: Noushad999
Published: 2026-01-18 21:43:43
License: 暂无描述

Hugging Face2026-01-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Noushad999/ML-1M-Syntax-Validated-Python-Code

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 3570206109 num_examples: 1014742 download_size: 1511077874 dataset_size: 3570206109 configs: - config_name: default data_files: - split: train path: data/train-* license: mit task_categories: - text-generation language: - en tags: - code - code-generation - machine-learning - python - artificial-intelligence size_categories: - 1M<n<10M pretty_name: ML-1M Syntax-Validated Python Code --- # ML-1M Syntax-Validated Python Code ## Dataset Summary **ML-1M Syntax-Validated Python Code** is a large-scale corpus containing over **1 million machine-learning–oriented Python programs** derived from *The Stack*, a permissively licensed collection of open-source source code. The dataset is constructed through **heuristic ML-domain filtering**, **syntactic validation**, and **basic safety checks**. It is intended to support **empirical analysis of real-world ML code**, **executability and dependency studies**, and the **training or evaluation of code-generation models** that must operate on realistic, imperfect software artifacts. > ⚠️ **Important:** This dataset does **not** guarantee runtime executability, semantic correctness, or environment completeness. --- ## Motivation Publicly shared machine learning code frequently fails to run “out of the box” due to missing dependencies, implicit project context, version drift, or environment-specific assumptions. Despite this, many existing code datasets implicitly assume executability or do not explicitly characterize such limitations. This dataset is designed to reflect **in-the-wild ML Python code** and to enable research on: - software executability - dependency fragility - reproducibility challenges - realistic machine-learning code generation --- ## Source Data ### Primary Source - **The Stack (Python subset)** — a large-scale corpus of permissively licensed source code (Kocetkov et al., 2022) ### Processing - Files are filtered, transformed, and reformatted. - No new code is authored; all programs originate from public repositories. --- ## Dataset Structure Each record in the dataset follows a lightweight instruction-style JSON schema: - **instruction**: A generic placeholder string included solely for compatibility with instruction-tuning pipelines. - **input**: Typically empty; retained for schema consistency. - **output**: Raw Python source code corresponding to a machine-learning–related program. The dataset should be treated primarily as a **code corpus**, rather than as a curated instruction-following benchmark. --- ## Intended Uses This dataset is suitable for the following research and development purposes: - Empirical software engineering studies of machine-learning Python code - Measurement of executability, dependency-related failures, and runtime error modes - Analysis of machine-learning software ecosystems at scale - Training or evaluation of code-generation models on realistic, in-the-wild ML code --- ## Limitations The dataset intentionally reflects real-world ML software practice and therefore has the following limitations: - Programs are **not guaranteed to execute successfully** - Dependencies are **not resolved or version-pinned** - Local or project-specific imports may be present - External resources such as datasets, GPUs, credentials, or configuration files may be implicitly assumed These limitations are intrinsic to the source material and are preserved to maintain realism. --- ## Licensing This dataset is derived from **The Stack**, which aggregates permissively licensed open-source source code. No additional licensing restrictions are imposed beyond those inherited from the original sources. Users are responsible for complying with the licenses of individual code fragments when redistributing, modifying, or executing the contents of this dataset. --- ## Citation If you use this dataset in academic research, tooling, or model training, please cite both **The Stack** and this derived corpus. --- ### BibTeX ```bibtex @article{kocetkov2022stack, title = {The Stack: 3 TB of permissively licensed source code}, author = {Kocetkov, Denis and Li, Raymond and Ben Allal, Loubna and others}, journal = {arXiv preprint arXiv:2211.15533}, year = {2022} } @dataset{ramim2026ml1m, title = {ML-1M Syntax-Validated Python Code}, author = {Ramim, Md Noushad Jahan}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Noushad999/ML-1M-Syntax-Validated-Python-Code} } Maintainer Md Noushad Jahan Ramim University of Asia Pacific contactwithnoushad@gmail.com

提供机构：

Noushad999

5,000+

优质数据集

54 个

任务类型

进入经典数据集