Meta Kaggle Code
收藏www.kaggle.com2024-12-03 更新2025-03-24 收录
下载链接:
https://www.kaggle.com/kaggle/meta-kaggle-code
下载链接
链接失效反馈官方服务:
资源简介:
#Explore our public notebook content!
Meta Kaggle Code is an extension to our popular [Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle) dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
## Why we’re releasing this dataset
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to [Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle) which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have [examined](https://arxiv.org/abs/2107.11929) how data scientists collaboratively solve problems, [analyzed](https://proceedings.neurips.cc/paper/2019/hash/ee39e503b6bedf0c98c388b7e8589aca-Abstract.html) overfitting in machine learning competitions, [compared](https://arxiv.org/abs/2006.08334) discussions between Kaggle and Stack Overflow communities, [and more](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22meta+kaggle%22&btnG=).
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
## Sensitive data
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
## Joining with Meta Kaggle
The files contained here are a subset of the `KernelVersions` in Meta Kaggle. The file names match the ids in the `KernelVersions` csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
## File organization
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: `kaggle-meta-kaggle-code-downloads`. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
## Questions / Comments
We love feedback! Let us know in the [Discussion](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/discussion) tab.
Happy Kaggling!
# 欢迎探索我们的公开笔记本资源!
Meta Kaggle Code 是热门的[Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle)数据集的扩展套件。该扩展数据集包含了Kaggle平台上数十万份采用Apache 2.0许可证的公开Python与R笔记本历史版本的全部原始源代码,这些代码曾被用于数据集分析、竞赛提交等各类机器学习相关工作。本数据集覆盖了近十年的历史数据,完整记录了机器学习开发范式经历巨大变革的完整历程。
## 数据集发布初衷
我们将Kaggle社区创作的所有代码整合至同一数据集,旨在助力全球研究者更便捷地探索并分享行业发展趋势相关的研究洞见。随着AI辅助开发的重要性与日俱增,我们相信该数据集还可用于微调面向机器学习专属代码生成任务的模型。
Meta Kaggle Code 亦是我们对开放数据与开放研究承诺的延续。这款新数据集是2016年发布的原版[Meta Kaggle](https://www.kaggle.com/datasets/kaggle/meta-kaggle)的配套资源。依托原版Meta Kaggle,社区用户已共享了近千份公开代码示例。基于原版Meta Kaggle撰写的研究论文已先后探究了数据科学家如何通过协作解决问题、剖析机器学习竞赛中的过拟合现象、对比Kaggle与Stack Overflow社区的讨论差异,以及更多相关研究,相关论文链接如下:
[1] https://arxiv.org/abs/2107.11929
[2] https://proceedings.neurips.cc/paper/2019/hash/ee39e503b6bedf0c98c388b7e8589aca-Abstract.html
[3] https://arxiv.org/abs/2006.08334
[4] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22meta+kaggle%22&btnG=
更具价值的是,原版Meta Kaggle可与本数据集实现深度联动。将二者结合使用时,用户可轻松追溯代码对应的竞赛项目、代码作者的进阶层级、笔记本获得的点赞数、收到的评论类型等海量信息。我们期待,这份数据集能为探索机器学习代码的创作规律带来无限可能,一如我们此刻的热忱与期许!
## 敏感数据说明
尽管我们已尽力过滤Kaggle用户发布的包含潜在敏感信息的笔记本代码,但本数据集仍可能包含此类内容。基于本数据集开展的研究、出版物、应用等场景,仅可使用或披露公开的、非敏感的信息。
## 与原版Meta Kaggle的关联
本数据集包含的文件是原版Meta Kaggle中`KernelVersions`(内核版本)字段的子集,文件名与`KernelVersions` CSV文件中的ID一一对应。原版Meta Kaggle涵盖了所有交互式会话与提交会话的数据,而本数据集仅包含提交会话的数据。
## 文件组织结构
本数据集采用两级目录结构进行组织。每个一级目录最多包含100万个文件,例如目录`123`涵盖了编号从123,000,000至123,999,999的所有版本。每个二级目录最多包含1000个文件,例如`123/456`涵盖了编号从123,456,000至123,456,999的所有版本。由于存在私有会话与交互式会话,实际每个目录中的文件数量通常远少于1000个。
本数据集中托管于Kaggle的ipynb文件未包含输出单元格。若需获取包含输出内容的完整ipynb文件,可通过公开GCS(Google Cloud Storage)存储桶`kaggle-meta-kaggle-code-downloads`下载。请注意,该存储桶采用「请求者付费」模式,您需要拥有启用了计费功能的GCP账户方可完成下载。详细信息请参阅:https://cloud.google.com/storage/docs/requester-pays
## 问题与反馈
我们期待收到您的反馈与建议!欢迎前往[讨论区](https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/discussion)留言交流。
祝Kaggle探索愉快!
提供机构:
Kaggle



