the-stack-metadata

Name: the-stack-metadata
Creator: maas
Published: 2025-12-05 11:37:37
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/bigcode/the-stack-metadata

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for The Stack Metadata ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Changelog](#changelog) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Usage Example](#usage-example) - [Dataset Creation](#dataset-creation) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Additional Information](#additional-information) - [Terms of Use for The Stack](#terms-of-use-for-the-stack) ## Dataset Description - **Homepage:** https://www.bigcode-project.org/ - **Repository:** https://github.com/bigcode-project - **Paper:** https://arxiv.org/abs/2211.15533 - **Leaderboard:** N/A - **Point of Contact:** contact@bigcode-project.org ### Changelog |Release|Description| |-|-| |v1.1| This is the first release of the metadata. It is for The Stack v1.1| |v1.2| Metadata dataset matching The Stack v1.2| ### Dataset Summary This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories. ### Supported Tasks and Leaderboards The main task is to recreate repository structure from the files of The Stack. Also, the set can be used for computing statistics and custom filtering or aggregation operations on The Stack. ## Dataset Structure ### Data Fields ![set structure](images/structure.png) The set is split into buckets by repositories. There are 944 buckets. Additionally to the fields in the image, `ri` contains `min_repo_event_datetime` which is the ealiest date and time of an event for a repo after Jan 1 2015. ![set usage](images/usage.png) As an example of an aggregation operation on The Stack, the image above shows conceptually a selection of stars ( and issues and PR count) for a file. Each unique file can be part of multiple repositories. So, The Stack releases unique files and aggregates meta information (e.g stars) from all repositories it belongs to. For example, for max_stars_count we take the maximum number of stars from all repositories the file is part of. The meta data will allow you to reconstruct repository directory structures. For this, for each repository form `ri` tabele it is needed to take all its files from `fi` table, find them in The Stack by file's `hexsha` and save those files' content under its path for a repository from `fi` table. For speed it is preferable to index The Stack by hexsha first. ### Usage Example Restore folder structure for python files in numpy repository ```python import datasets from pathlib import Path from tqdm.auto import tqdm import pandas as pd # assuming metadata is cloned into the local folder /data/hf_repos/the-stack-metadata # the stack is cloned into the local folder /data/hf_repos/the-stack-v1.1 # destination folder is in /repo_workdir/numpy_restored the_stack_meta_path = Path('/data/hf_repos/the-stack-metadata') the_stack_path = Path('/data/hf_repos/the-stack-v1.1') repo_dst_root = Path('/repo_workdir/numpy_restored') repo_name = 'numpy/numpy' # Get bucket with numpy repo info # meta_bucket_path = None #for fn in tqdm(list((the_stack_meta_path/'data').glob('*/ri.parquet'))): # df = pd.read_parquet(fn) # if any(df['name'] == repo_name): # meta_bucket_path = fn # break meta_bucket_path = the_stack_meta_path / 'data/255_944' # Get repository id from repo name ri_id = pd.read_parquet( meta_bucket_path / 'ri.parquet' ).query( f'`name` == "{repo_name}"' )['id'].to_list()[0] # Get files information for the reopository files_info = pd.read_parquet( meta_bucket_path / 'fi.parquet' ).query( f'`ri_id` == {ri_id} and `size` != 0 and `is_deleted` == False' ) # Convert DF with files information to a dictionary by language and then file hexsha # there can be more than one file with the same hexsha in the repo so we gather # all instances per unique hexsha files_info_dict = { k: v[['hexsha', 'path']].groupby('hexsha').apply(lambda x: list(x['path'])).to_dict() for k, v in files_info.groupby('lang_ex') } # Load Python part of The Stack ds = datasets.load_dataset( str(the_stack_path/'data/python'), num_proc=10, ignore_verifications=True ) # Save file content of the python files in the numpy reposirotry in their appropriate locations def save_file_content(example, files_info_dict, repo_dst_root): if example['hexsha'] in files_info_dict: for el in files_info_dict[example['hexsha']]: path = repo_dst_root / el path.parent.mkdir(parents=True, exist_ok=True) path.write_text(example['content']) ds.map( save_file_content, fn_kwargs={'files_info_dict': files_info_dict['Python'], 'repo_dst_root': repo_dst_root}, num_proc=10 ) ``` ## Dataset Creation Please refer to [the section](https://huggingface.co/datasets/bigcode/the-stack#dataset-creation) in The Stack. ## Considerations for Using the Data Please refer to [the section](https://huggingface.co/datasets/bigcode/the-stack#considerations-for-using-the-data) in The Stack. ## Additional Information Please refer to [the section](https://huggingface.co/datasets/bigcode/the-stack#additional-information) in The Stack. ## Terms of Use for The Stack Please refer to [the section](https://huggingface.co/datasets/bigcode/the-stack#terms-of-use-for-the-stack) in The Stack.

# 《The Stack》元数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集说明](#dataset-description) - [更新日志](#changelog) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [使用示例](#usage-example) - [数据集构建](#dataset-creation) - [数据使用注意事项](#considerations-for-using-the-data) - [附加信息](#additional-information) - [《The Stack》使用条款](#terms-of-use-for-the-stack) ## 数据集说明 - **主页**：https://www.bigcode-project.org/ - **代码仓库**：https://github.com/bigcode-project - **相关论文**：https://arxiv.org/abs/2211.15533 - **排行榜**：无 - **联系方式**：contact@bigcode-project.org ### 更新日志 |版本|更新说明| |-|-| |v1.1| 本数据集为元数据集首个正式发布版本，适配《The Stack》v1.1 |v1.2| 本元数据集与《The Stack》v1.2版本完全匹配| ### 数据集概述本数据集为《The Stack》所使用的代码仓库提供补充元数据，包含文件路径、检测到的许可证及其他仓库相关信息。 ### 支持任务与排行榜本数据集核心任务为基于《The Stack》的文件集重构代码仓库目录结构，同时可用于计算统计指标、对《The Stack》执行自定义筛选或聚合操作。 ## 数据集结构 ### 数据字段 ![数据集结构](images/structure.png) 本数据集按代码仓库划分分桶，共计944个分桶。除上图所示字段外，`ri`表包含`min_repo_event_datetime`字段，其值为2015年1月1日之后该仓库的最早事件时间戳。 ![数据集使用示例](images/usage.png) 上图以概念化方式展示了对《The Stack》执行聚合操作的一例：针对文件选取其对应星标数（及议题、PR数量）。每个唯一文件可隶属于多个代码仓库，因此《The Stack》以唯一文件为单位发布，并聚合其所属所有仓库的元信息（如星标数）。例如，`max_stars_count`字段将取该文件所属所有仓库中的最大星标数。通过本元数据集可重构代码仓库目录结构：对于`ri`表中的每个仓库，需从`fi`表中获取其所有文件信息，通过文件的`hexsha`在《The Stack》中定位对应文件，并将文件内容按照`fi`表中记录的路径保存至对应仓库目录。为提升效率，建议先通过`hexsha`为《The Stack》建立索引。 ### 使用示例还原numpy仓库的Python文件目录结构 python import datasets from pathlib import Path from tqdm.auto import tqdm import pandas as pd # 假设元数据集已克隆至本地目录 `/data/hf_repos/the-stack-metadata` # 《The Stack》数据集已克隆至本地目录 `/data/hf_repos/the-stack-v1.1` # 目标还原目录为 `/repo_workdir/numpy_restored` the_stack_meta_path = Path('/data/hf_repos/the-stack-metadata') the_stack_path = Path('/data/hf_repos/the-stack-v1.1') repo_dst_root = Path('/repo_workdir/numpy_restored') repo_name = 'numpy/numpy' # 获取包含numpy仓库信息的分桶 # meta_bucket_path = None #for fn in tqdm(list((the_stack_meta_path/'data').glob('*/ri.parquet'))): # df = pd.read_parquet(fn) # if any(df['name'] == repo_name): # meta_bucket_path = fn # break meta_bucket_path = the_stack_meta_path / 'data/255_944' # 从仓库名称获取对应的仓库ID ri_id = pd.read_parquet( meta_bucket_path / 'ri.parquet' ).query( f'`name` == "{repo_name}"' )['id'].to_list()[0] # 获取该仓库的文件信息 files_info = pd.read_parquet( meta_bucket_path / 'fi.parquet' ).query( f'`ri_id` == {ri_id} and `size` != 0 and `is_deleted` == False' ) # 按语言分组文件信息，并按唯一哈希值映射文件路径 # 同一仓库中可能存在多个哈希值相同的文件，因此需按唯一哈希值收集所有对应路径 files_info_dict = { k: v[['hexsha', 'path']].groupby('hexsha').apply(lambda x: list(x['path'])).to_dict() for k, v in files_info.groupby('lang_ex') } # 加载《The Stack》中的Python子集 ds = datasets.load_dataset( str(the_stack_path/'data/python'), num_proc=10, ignore_verifications=True ) # 将numpy仓库中Python文件的内容保存至对应目录 def save_file_content(example, files_info_dict, repo_dst_root): if example['hexsha'] in files_info_dict: for el in files_info_dict[example['hexsha']]: path = repo_dst_root / el path.parent.mkdir(parents=True, exist_ok=True) path.write_text(example['content']) ds.map( save_file_content, fn_kwargs={'files_info_dict': files_info_dict['Python'], 'repo_dst_root': repo_dst_root}, num_proc=10 ) ## 数据集构建请参考《The Stack》数据集卡片中的[对应章节](https://huggingface.co/datasets/bigcode/the-stack#dataset-creation)。 ## 数据使用注意事项请参考《The Stack》数据集卡片中的[对应章节](https://huggingface.co/datasets/bigcode/the-stack#considerations-for-using-the-data)。 ## 附加信息请参考《The Stack》数据集卡片中的[对应章节](https://huggingface.co/datasets/bigcode/the-stack#additional-information)。 ## 《The Stack》使用条款请参考《The Stack》数据集卡片中的[对应章节](https://huggingface.co/datasets/bigcode/the-stack#terms-of-use-for-the-stack)。

提供机构：

maas

创建时间：

2025-10-11

搜集汇总

数据集介绍