five

Cyrile/dataset-the-stack-v2-dedup-sub

收藏
Hugging Face2025-04-01 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Cyrile/dataset-the-stack-v2-dedup-sub
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: C features: - name: blob_id dtype: string - name: directory_id dtype: string - name: path dtype: string - name: content_id dtype: string - name: detected_licenses sequence: string - name: license_type dtype: string - name: repo_name dtype: string - name: snapshot_id dtype: string - name: revision_id dtype: string - name: branch_name dtype: string - name: visit_date dtype: timestamp[ns] - name: revision_date dtype: timestamp[ns] - name: committer_date dtype: timestamp[ns] - name: github_id dtype: int64 - name: star_events_count dtype: int64 - name: fork_events_count dtype: int64 - name: gha_license_id dtype: string - name: gha_event_created_at dtype: timestamp[ns] - name: gha_created_at dtype: timestamp[ns] - name: gha_language dtype: string - name: src_encoding dtype: string - name: language dtype: string - name: is_vendor dtype: bool - name: is_generated dtype: bool - name: length_bytes dtype: int64 - name: extension dtype: string - name: filename dtype: string - name: content dtype: string splits: - name: train num_bytes: 64658205790 num_examples: 6022485 download_size: 20685881003 dataset_size: 64658205790 - config_name: C++ features: - name: blob_id dtype: string - name: directory_id dtype: string - name: path dtype: string - name: content_id dtype: string - name: detected_licenses sequence: string - name: license_type dtype: string - name: repo_name dtype: string - name: snapshot_id dtype: string - name: revision_id dtype: string - name: branch_name dtype: string - name: visit_date dtype: timestamp[ns] - name: revision_date dtype: timestamp[ns] - name: committer_date dtype: timestamp[ns] - name: github_id dtype: int64 - name: star_events_count dtype: int64 - name: fork_events_count dtype: int64 - name: gha_license_id dtype: string - name: gha_event_created_at dtype: timestamp[ns] - name: gha_created_at dtype: timestamp[ns] - name: gha_language dtype: string - name: src_encoding dtype: string - name: language dtype: string - name: is_vendor dtype: bool - name: is_generated dtype: bool - name: length_bytes dtype: int64 - name: extension dtype: string - name: filename dtype: string - name: content dtype: string splits: - name: train num_bytes: 105532622911 num_examples: 12003027 download_size: 41089347301 dataset_size: 105532622911 - config_name: Java features: - name: blob_id dtype: string - name: directory_id dtype: string - name: path dtype: string - name: content_id dtype: string - name: detected_licenses sequence: string - name: license_type dtype: string - name: repo_name dtype: string - name: snapshot_id dtype: string - name: revision_id dtype: string - name: branch_name dtype: string - name: visit_date dtype: timestamp[ns] - name: revision_date dtype: timestamp[ns] - name: committer_date dtype: timestamp[ns] - name: github_id dtype: int64 - name: star_events_count dtype: int64 - name: fork_events_count dtype: int64 - name: gha_license_id dtype: string - name: gha_event_created_at dtype: timestamp[ns] - name: gha_created_at dtype: timestamp[ns] - name: gha_language dtype: string - name: src_encoding dtype: string - name: language dtype: string - name: is_vendor dtype: bool - name: is_generated dtype: bool - name: length_bytes dtype: int64 - name: extension dtype: string - name: filename dtype: string - name: content dtype: string splits: - name: train num_bytes: 96702367410 num_examples: 23840009 download_size: 35433610745 dataset_size: 96702367410 - config_name: JavaScript features: - name: blob_id dtype: string - name: directory_id dtype: string - name: path dtype: string - name: content_id dtype: string - name: detected_licenses sequence: string - name: license_type dtype: string - name: repo_name dtype: string - name: snapshot_id dtype: string - name: revision_id dtype: string - name: branch_name dtype: string - name: visit_date dtype: timestamp[ns] - name: revision_date dtype: timestamp[ns] - name: committer_date dtype: timestamp[ns] - name: github_id dtype: int64 - name: star_events_count dtype: int64 - name: fork_events_count dtype: int64 - name: gha_license_id dtype: string - name: gha_event_created_at dtype: timestamp[ns] - name: gha_created_at dtype: timestamp[ns] - name: gha_language dtype: string - name: src_encoding dtype: string - name: language dtype: string - name: is_vendor dtype: bool - name: is_generated dtype: bool - name: length_bytes dtype: int64 - name: extension dtype: string - name: filename dtype: string - name: content dtype: string splits: - name: train num_bytes: 247875047930 num_examples: 22850027 download_size: 102329028722 dataset_size: 247875047930 - config_name: Python features: - name: blob_id dtype: string - name: directory_id dtype: string - name: path dtype: string - name: content_id dtype: string - name: detected_licenses sequence: string - name: license_type dtype: string - name: repo_name dtype: string - name: snapshot_id dtype: string - name: revision_id dtype: string - name: branch_name dtype: string - name: visit_date dtype: timestamp[ns] - name: revision_date dtype: timestamp[ns] - name: committer_date dtype: timestamp[ns] - name: github_id dtype: int64 - name: star_events_count dtype: int64 - name: fork_events_count dtype: int64 - name: gha_license_id dtype: string - name: gha_event_created_at dtype: timestamp[ns] - name: gha_created_at dtype: timestamp[ns] - name: gha_language dtype: string - name: src_encoding dtype: string - name: language dtype: string - name: is_vendor dtype: bool - name: is_generated dtype: bool - name: length_bytes dtype: int64 - name: extension dtype: string - name: filename dtype: string - name: content dtype: string splits: - name: train num_bytes: 83396371824 num_examples: 18065153 download_size: 34470948501 dataset_size: 83396371824 configs: - config_name: C data_files: - split: train path: C/train-* - config_name: C++ data_files: - split: train path: C++/train-* - config_name: Java data_files: - split: train path: Java/train-* - config_name: JavaScript data_files: - split: train path: JavaScript/train-* - config_name: Python data_files: - split: train path: Python/train-* license: other task_categories: - text-generation tags: - Python - Java - JavaScript - C/C++ size_categories: - 10M<n<100M --- # The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++) **TempestTeam/dataset-the-stack-v2-dedup-sub** ## Dataset Summary This dataset is a **language-filtered and self-contained subset** of [bigcode/the-stack-v2-dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup), part of the BigCode Project. It contains only files written in the following programming languages: - **Python** 🐍 - **Java** ☕ - **JavaScript** 📜 - **C** ⚙️ - **C++** ⚙️ Unlike the original dataset, which only includes metadata and Software Heritage IDs, **this subset includes the actual file contents**, enabling out-of-the-box training and analysis of code models, without requiring SWH downloads or AWS credentials. --- ## Use Cases This dataset is intended for: - Pretraining or fine-tuning Code LLMs on high-quality and permissively licensed code - Language-specific evaluation or benchmarking - Research on code representation, generation, or completion across the 5 selected languages --- ## How to Use ```python from datasets import load_dataset ds = load_dataset( "TempestTeam/dataset-the-stack-v2-dedup-sub", name="Python", split="train", streaming=True ) ``` --- ## Dataset Structure Each example in the dataset contains the following fields (inherited from the original Stack v2): - `content` (string): **The full file content**, decoded in UTF-8 - `language` (string): Programming language of the file (detected by go-enry / linguist) - `path` (string): File path within the repository - `repo_name` (string): Repository name on GitHub - `detected_licenses` (list of strings): SPDX license identifiers - `license_type` (string): License type: `permissive` or `no_license` - `is_vendor` (bool): Whether the file is from a dependency - `is_generated` (bool): Whether the file is detected as generated - `length_bytes` (int): File size in bytes - Plus other metadata like: - `blob_id`, `directory_id`, `revision_id`, `snapshot_id`, `visit_date`, `committer_date` - GitHub metadata: `github_id`, `gha_language`, `gha_license_id`, etc. --- ## Source Dataset This dataset is derived from: 👉 [bigcode/the-stack-v2-dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) The full Stack v2 dataset is built from the Software Heritage archive and GitHub Archive metadata, and spans 600+ programming languages. This subset narrows the focus to 5 popular languages while **retaining full file content**. --- ## Curation Rationale The five selected languages are among the most widely used in open-source projects and code LLM research. By focusing on this curated set, we reduce dataset size, eliminate irrelevant files, and speed up experimentation while preserving linguistic diversity and utility for real-world applications. --- ## License and Legal Considerations - Only **permissively licensed** files (or those with no license) are included. - Licensing information is provided at file level. - The dataset may contain personal information (e.g., emails, keys) present in public repositories. Sensitive data has been reduced via deduplication but may still exist. - Usage must comply with the original license of each file. To request the removal of your code, refer to the [BigCode opt-out process](https://github.com/bigcode-project/bigcode-dataset#how-to-remove-your-code-from-the-dataset). --- ## Citation If you use this dataset, please cite the original Stack v2 paper: ```bibtex @misc{lozhkov2024starcoder, title={StarCoder 2 and The Stack v2: The Next Generation}, author={Anton Lozhkov and Raymond Li and Loubna Ben Allal and others}, year={2024}, eprint={2402.19173}, archivePrefix={arXiv}, primaryClass={cs.SE} } ```
提供机构:
Cyrile
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作