kd13/stack-v2-mini

Name: kd13/stack-v2-mini
Creator: kd13
Published: 2026-04-28 12:21:38
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/kd13/stack-v2-mini

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: bigcode-openrail-m task_categories: - text-generation - fill-mask language: - en pretty_name: mini-multicode size_categories: - 100K<n<1M --- # Stack v2 Clean — 200K Multi-Language Code Subset A cleaned and filtered subset of [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2). ## Dataset Summary This dataset was assembled to support fine-tuning models such as ModernBERT, Llama, Qwen on programming language data. It provides a balanced, deduplicated, and filtered collection of real-world source files drawn from public repositories indexed by Software Heritage and curated by the BigCode project. | Language | Files | |-------------|---------| | Python | 40,001 | | JavaScript | 40,001 | | Java | 40,001 | | C++ | 40,001 | | Go | 40,001 | | **Total** | **200,005** | ## Limitations and Considerations This dataset is a relatively small sample of the full Stack v2 corpus and is not intended for training large code generation models from scratch. Files retain their original licenses as classified upstream, and users are responsible for verifying license compatibility with their downstream use cases. No personally identifiable information removal pass has been applied beyond what is present in the upstream Stack v2 release; users redistributing derivative artifacts should consider running a PII scrubbing pass such as `bigcode-pii` before publication. Near-duplicate detection (for example, MinHash-based) was not applied and may be beneficial for some training scenarios. ## Source and License The underlying data originates from `bigcode/the-stack-v2`, which is governed by the BigCode project's terms of use and the original repository licenses of each source file. Users of this derivative dataset must comply with the upstream Stack v2 terms, available on the [original dataset page](https://huggingface.co/datasets/bigcode/the-stack-v2). The `license_type` field is preserved from the upstream dataset to support license-aware filtering. ## Citation If you use this dataset, please cite the original Stack v2 release: ``` @article{lozhkov2024starcoder, title={StarCoder 2 and The Stack v2: The Next Generation}, author={Lozhkov, Anton and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```

提供机构：

kd13

5,000+

优质数据集

54 个

任务类型

进入经典数据集