five

eth_py150_open

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/eth_py150_open
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for ethpy150open ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://www.sri.inf.ethz.ch/py150 - **Repository:** https://github.com/google-research-datasets/eth_py150_open - **Paper:** https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf - **Leaderboard:** None - **Point of Contact:** Aditya Kanade <kanade@iisc.ac.in>, Petros Maniatis <maniatis@google.com> ### Dataset Summary A redistributable subset of the [ETH Py150 corpus](https://www.sri.inf.ethz.ch/py150), introduced in the ICML 2020 paper ['Learning and Evaluating Contextual Embedding of Source Code'](https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf) ### Supported Tasks and Leaderboards [More Information Needed] ### Languages English ## Dataset Structure List of dicts of { "filepath": The relative URL containing the path to the file on GitHub "license": The license used for that specific file or repository } ### Data Instances { "filepath": "0rpc/zerorpc-python/setup.py", "license": "mit" }, { "filepath": "0rpc/zerorpc-python/zerorpc/heartbeat.py", "license": "mit" }, ### Data Fields - `filepath`: The relative URL containing the path to the file on GitHub - `license`: The license used for that specific file or repository ### Data Splits | | Train | Valid | Test | | ----- | ------- | ----- | ----- | | Dataset Split | 74749 | 8302 | 41457 | ## Dataset Creation The original dataset is at https://www.sri.inf.ethz.ch/py150 ### Curation Rationale To generate a more redistributable version of the dataset ### Source Data #### Initial Data Collection and Normalization All the urls are filepaths relative to GitHub and the master branch was used as available at the time #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Apache License 2.0 ### Citation Information @inproceedings{kanade2020learning, title={Learning and Evaluating Contextual Embedding of Source Code}, author={Kanade, Aditya and Maniatis, Petros and Balakrishnan, Gogul and Shi, Kensen}, booktitle={International Conference on Machine Learning}, pages={5110--5121}, year={2020}, organization={PMLR} } ### Contributions Thanks to [@Bharat123rox](https://github.com/Bharat123rox) for adding this dataset.

# ethpy150open 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据集划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页**:https://www.sri.inf.ethz.ch/py150 - **代码仓库**:https://github.com/google-research-datasets/eth_py150_open - **相关论文**:https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf - **排行榜**:无 - **联系方式**:Aditya Kanade <kanade@iisc.ac.in>、Petros Maniatis <maniatis@google.com> ### 数据集摘要 本数据集是[ETH Py150语料库(ETH Py150 corpus)]的可再分发子集,该语料库出自ICML 2020论文《学习与评估源代码的上下文嵌入》(*Learning and Evaluating Contextual Embedding of Source Code*)。 ### 支持的任务与排行榜 [需补充更多信息] ### 语言 英语 ## 数据集结构 由如下格式的字典组成的列表: { "filepath": 指向GitHub上该文件路径的相对URL, "license": 该特定文件或仓库所采用的许可证 } ### 数据实例 { "filepath": "0rpc/zerorpc-python/setup.py", "license": "mit" }, { "filepath": "0rpc/zerorpc-python/zerorpc/heartbeat.py", "license": "mit" } ### 数据字段 - `filepath`:指向GitHub上该文件路径的相对URL - `license`:该特定文件或仓库所采用的许可证 ### 数据集划分 | | 训练集 | 验证集 | 测试集 | | ----- | ------- | ----- | ----- | | 数据集划分 | 74749 | 8302 | 41457 | ## 数据集构建 原始数据集地址为 https://www.sri.inf.ethz.ch/py150 ### 遴选依据 为生成该数据集的可再分发版本。 ### 源数据 #### 初始数据收集与标准化 所有URL均为相对于GitHub的文件路径,且采用采集时可用的主分支数据。 #### 源语言生产者为何人? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者为何人? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集策展人 [需补充更多信息] ### 许可证信息 Apache许可证2.0 ### 引用信息 @inproceedings{kanade2020learning, title={Learning and Evaluating Contextual Embedding of Source Code}, author={Kanade, Aditya and Maniatis, Petros and Balakrishnan, Gogul and Shi, Kensen}, booktitle={International Conference on Machine Learning}, pages={5110--5121}, year={2020}, organization={PMLR} } ### 贡献 感谢[@Bharat123rox](https://github.com/Bharat123rox)为本数据集的收录提供支持。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作