five

neural_code_search

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/facebook/neural_code_search
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Neural Code Search ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [facebookresearch / Neural-Code-Search-Evaluation-Dataset](https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset/tree/master/data) - **Repository:** [Github](https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset.git) - **Paper:** [arXiv](https://arxiv.org/pdf/1908.09804.pdf) ### Dataset Summary Neural-Code-Search-Evaluation-Dataset presents an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models (NCS, UNIF) from recent work. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages EN - English ## Dataset Structure ### Data Instances #### Search Corpus The search corpus is indexed using all method bodies parsed from the 24,549 GitHub repositories. In total, there are 4,716,814 methods in this corpus. The code search model will find relevant code snippets (i.e. method bodies) from this corpus given a natural language query. In this data release, we will provide the following information for each method in the corpus: #### Evaluation Dataset The evaluation dataset is composed of 287 Stack Overflow question and answer pairs ### Data Fields #### Search Corpus - id: Each method in the corpus has a unique numeric identifier. This ID number will also be referenced in our evaluation dataset. - filepath: The file path is in the format of :owner/:repo/relative-file-path-to-the-repo method_name - start_line: Starting line number of the method in the file. - end_line: Ending line number of the method in the file. - url: GitHub link to the method body with commit ID and line numbers encoded. #### Evaluation Dataset - stackoverflow_id: Stack Overflow post ID. - question: Title fo the Stack Overflow post. - question_url: URL of the Stack Overflow post. - answer: Code snippet answer to the question. ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The most popular Android repositories on GitHub (ranked by the number of stars) is used to create the search corpus. For each repository that we indexed, we provide the link, specific to the commit that was used.5 In total, there are 24,549 repositories. #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations Dataset provided for research purposes only. Please check dataset license for additional information. ## Additional Information ### Dataset Curators Hongyu Li, Seohyun Kim and Satish Chandra ### Licensing Information CC-BY-NC 4.0 (Attr Non-Commercial Inter.) ### Citation Information arXiv:1908.09804 [cs.SE] ### Contributions Thanks to [@vinaykudari](https://github.com/vinaykudari) for adding this dataset.

# 神经代码搜索(Neural Code Search)数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** [facebookresearch/神经代码搜索评测数据集仓库](https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset/tree/master/data) - **代码仓库:** [GitHub](https://github.com/facebookresearch/Neural-Code-Search-Evaluation-Dataset.git) - **相关论文:** [arXiv](https://arxiv.org/pdf/1908.09804.pdf) ### 数据集摘要 神经代码搜索评测数据集(Neural-Code-Search-Evaluation-Dataset)提供了一个由自然语言查询与代码片段对组成的评测数据集,旨在为该领域的后续研究提供通用基准测试集。此外,我们还提供了基于近期两项工作的两款代码搜索模型(NCS、UNIF)的评测结果。 ### 支持任务与基准排行榜 [需补充更多信息] ### 语言 英语(EN) ## 数据集结构 ### 数据实例 #### 搜索语料库 搜索语料库基于从24549个GitHub仓库中解析得到的所有方法体构建,总共有4716814个方法。代码搜索模型可基于给定的自然语言查询,从该语料库中检索相关代码片段(即方法体)。本次发布的数据将为语料库中的每个方法提供以下信息: #### 评测数据集 评测数据集由287个Stack Overflow问答对组成 ### 数据字段 #### 搜索语料库 - `id`:语料库中的每个方法均拥有唯一的数字标识符,该ID也会在评测数据集中被引用。 - `filepath`:文件路径格式为`:owner/:repo/仓库相对文件路径` - `method_name`:方法名称 - `start_line`:方法在文件中的起始行号 - `end_line`:方法在文件中的结束行号 - `url`:指向该方法体的GitHub链接,已编码提交ID与行号信息 #### 评测数据集 - `stackoverflow_id`:Stack Overflow帖子ID - `question`:Stack Overflow帖子的标题 - `question_url`:Stack Overflow帖子的URL - `answer`:针对该问题的代码片段答案 ### 数据划分 [需补充更多信息] ## 数据集构建 ### 遴选依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化处理 搜索语料库基于GitHub上最受欢迎的Android仓库(按星标数量排序)构建。我们为每个索引的仓库提供了对应提交版本的专属链接,共计24549个仓库。 #### 源语言内容创作者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差分析 [需补充更多信息] ### 其他已知局限性 本数据集仅用于研究用途,详细信息请查阅数据集许可协议。 ## 附加信息 ### 数据集维护者 Hongyu Li、Seohyun Kim与Satish Chandra ### 许可信息 CC-BY-NC 4.0(署名-非商业性使用) ### 引用信息 arXiv:1908.09804 [cs.SE] ### 贡献致谢 感谢[@vinaykudari](https://github.com/vinaykudari) 添加本数据集。
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作