five

Python Annotated Code Search (PACS) Datasets & Pretrained Models

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4001601
下载链接
链接失效反馈
官方服务:
资源简介:
This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch  Datasets There are three types of datasets: snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20 The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora: staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE.  conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange,  LICENSE).  Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License. ncs-embedder-so-ds-feb20 ncs-embedder-staqc-py tnbow-embedder-so-ds-feb20 use-embedder-pacs ensemble-embedder-pacs
创建时间:
2020-08-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作