five

kd13/stack-v2-mini

收藏
Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/kd13/stack-v2-mini
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: bigcode-openrail-m task_categories: - text-generation - fill-mask language: - en pretty_name: mini-multicode size_categories: - 100K<n<1M --- # Stack v2 Clean — 200K Multi-Language Code Subset A cleaned and filtered subset of [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2). ## Dataset Summary This dataset was assembled to support fine-tuning models such as ModernBERT, Llama, Qwen on programming language data. It provides a balanced, deduplicated, and filtered collection of real-world source files drawn from public repositories indexed by Software Heritage and curated by the BigCode project. | Language | Files | |-------------|---------| | Python | 40,001 | | JavaScript | 40,001 | | Java | 40,001 | | C++ | 40,001 | | Go | 40,001 | | **Total** | **200,005** | ## Limitations and Considerations This dataset is a relatively small sample of the full Stack v2 corpus and is not intended for training large code generation models from scratch. Files retain their original licenses as classified upstream, and users are responsible for verifying license compatibility with their downstream use cases. No personally identifiable information removal pass has been applied beyond what is present in the upstream Stack v2 release; users redistributing derivative artifacts should consider running a PII scrubbing pass such as `bigcode-pii` before publication. Near-duplicate detection (for example, MinHash-based) was not applied and may be beneficial for some training scenarios. ## Source and License The underlying data originates from `bigcode/the-stack-v2`, which is governed by the BigCode project's terms of use and the original repository licenses of each source file. Users of this derivative dataset must comply with the upstream Stack v2 terms, available on the [original dataset page](https://huggingface.co/datasets/bigcode/the-stack-v2). The `license_type` field is preserved from the upstream dataset to support license-aware filtering. ## Citation If you use this dataset, please cite the original Stack v2 release: ``` @article{lozhkov2024starcoder, title={StarCoder 2 and The Stack v2: The Next Generation}, author={Lozhkov, Anton and others}, journal={arXiv preprint arXiv:2402.19173}, year={2024} } ```
提供机构:
kd13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作