Python Annotated Code Search (PACS) Datasets & Pretrained Models
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4001601
下载链接
链接失效反馈官方服务:
资源简介:
This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch
Datasets
There are three types of datasets:
snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated
code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test
training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20
The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora:
staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE.
conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE
The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE).
Pre-trained models
Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License.
ncs-embedder-so-ds-feb20
ncs-embedder-staqc-py
tnbow-embedder-so-ds-feb20
use-embedder-pacs
ensemble-embedder-pacs
创建时间:
2020-08-27



