Python Annotated Code Search (PACS) Datasets & Pretrained Models

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/4001601

下载链接

链接失效反馈

官方服务：

资源简介：

This upload contains datasets and pre-trained models used for the paper Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent. The code for easily loading these datasets and models will be made available here: http://github.com/nokia/codesearch Datasets There are three types of datasets: snippet collections (code snippets + natural language descriptions): so-ds-feb20, staqc-py-cleaned, conala-curated code search evaluation data (queries linked to relevant snippets of one of the snippet collections): so-ds-feb20-{valid|test}, staqc-py-raw-{valid|test}, conala-curated-0.5-test training data (datasets used to train code retrieval models): so-duplicates-pacs-train, so-python-question-titles-feb20 The staqc-py-cleaned snippet collection, and the conala-curated datasets were derived from existing corpora: staqc-py-cleaned was derived from the Python StaQC snippet collection. See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, LICENSE. conala-curated was derived from the conala corpus. See https://conala-corpus.github.io/ , LICENSE The other datasets were mined directly from a recent Stack Overflow dump (https://archive.org/details/stackexchange, LICENSE). Pre-trained models Each model can embed queries and (annotated) code snippets in the same space. The models are released under a BSD 3-Clause License. ncs-embedder-so-ds-feb20 ncs-embedder-staqc-py tnbow-embedder-so-ds-feb20 use-embedder-pacs ensemble-embedder-pacs

创建时间：

2020-08-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集