nasa-impact/nasa-science-code-benchmark-v0.1

Name: nasa-impact/nasa-science-code-benchmark-v0.1
Creator: nasa-impact
Published: 2026-04-10 18:41:15
License: 暂无描述

Hugging Face2026-04-10 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/nasa-impact/nasa-science-code-benchmark-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- # NASA Code Retrieval Benchmark v0.1 > **Note:** This dataset has been superseded by [nasa-impact/nasa-science-code-benchmark-v0.1.1](https://huggingface.co/datasets/nasa-impact/nasa-science-code-benchmark-v0.1.1), which introduces a hierarchical structure, official Hugging Face dataset configurations, and evaluation by NASA science division. Please use v0.1.1 for new work. This dataset provides a code retrieval benchmark based on code from **7 programming languages** (Python, C, C++, Java, JavaScript, Fortran, and Matlab) sourced from NASA's GitHub repositories. It serves as the **held-out test set** for evaluating information retrieval models trained on NASA science code. The primary task is to retrieve a relevant code snippet (from the corpus) given a natural language query, which can be a docstring or a code identifier (e.g., function or class name). ## Licensing and Intellectual Property This dataset is released under **CC-BY-4.0** and contains **only structured metadata and annotations** produced by the dataset authors. It does **not** redistribute original source code from the indexed repositories. The `corpus.jsonl` file contains **placeholders** for original code content rather than the source code itself. This is by design to respect the intellectual property and licensing terms of individual repository owners. Users who wish to populate the corpus for research purposes may do so by fetching content directly from the source repositories using the replication scripts provided in the companion repository: 👉 [NASA-IMPACT/github-code-discovery](https://github.com/NASA-IMPACT/github-code-discovery) — see `scripts/code_snippet/` Please ensure you comply with the licensing terms of each individual repository when using the fetched content. ## Related Resources - **Retrieval benchmark system**: [NASA-IMPACT/nasa-science-repo-benchmark](https://github.com/NASA-IMPACT/nasa-science-repo-benchmark) - **Data collection & replication scripts**: [NASA-IMPACT/github-code-discovery](https://github.com/NASA-IMPACT/github-code-discovery) - **Latest version**: [nasa-impact/nasa-science-code-benchmark-v0.1.1](https://huggingface.co/datasets/nasa-impact/nasa-science-code-benchmark-v0.1.1) ## Source Data & Training This benchmark is the official test set corresponding to the following training and validation datasets. * [nasa-impact/nasa-science-function-code-docstring](https://huggingface.co/datasets/nasa-impact/nasa-science-function-code-docstring) * [nasa-impact/nasa-science-class-code-docstring](https://huggingface.co/datasets/nasa-impact/nasa-science-class-code-docstring) * [nasa-impact/nasa-science-function-code-identifier](https://huggingface.co/datasets/nasa-impact/nasa-science-function-code-identifier) * [nasa-impact/nasa-science-class-code-identifier](https://huggingface.co/datasets/nasa-impact/nasa-science-class-code-identifier) --- ## Dataset Statistics * **Total unique corpus entries:** 117,950 * **Total unique query entries:** 119,720 --- ## Dataset Structure The dataset follows a standard Information Retrieval format with three main components: a **corpus**, a set of **queries**, and query-relevance judgments (**qrels**). ### Data Fields * **`corpus.jsonl`**: A collection of all unique code snippets (functions and classes) from all languages. * `_id`: A unique string identifier for the code snippet. * `text`: ⚠️ **Placeholder** — use replication scripts to populate with original source code. * **`queries.jsonl`**: A collection of all unique queries (docstrings and identifiers). * `_id`: A unique string identifier for the query. * `text`: The natural language query. * **`qrels/`**: A directory containing Tab-Separated Values (TSV) files that map queries to their relevant code snippets. This is the ground truth for evaluation. Each file has the format `query-id corpus-id score`. ### Qrels (Query-Relevance Pairs) The dataset provides several qrels files to evaluate performance across different programming languages and query types. #### By Programming Language | Qrels File | Description | Size | | :--- | :--- | :--- | | `python.tsv` | All query-corpus pairs for the Python dataset. | 64,110 | | `c.tsv` | All query-corpus pairs for the C dataset. | 17,149 | | `c++.tsv` | All query-corpus pairs for the C++ dataset. | 14,975 | | `java.tsv` | All query-corpus pairs for the Java dataset. | 14,088 | | `javascript.tsv` | All query-corpus pairs for the JavaScript dataset. | 5,159 | | `fortran.tsv` | All query-corpus pairs for the Fortran dataset. | 3,586 | | `matlab.tsv` | All query-corpus pairs for the Matlab dataset. | 653 | #### By Query Type | Qrels File | Description | Size | | :--- | :--- | :--- | | `nasa_science_function_code_docstring_heldout.tsv` | Pairs where the query is a **function docstring**. | 61,083 | | `nasa_science_function_code_identifier_heldout.tsv` | Pairs where the query is a **function name** (identifier). | 32,742 | | `nasa_science_class_code_docstring_heldout.tsv` | Pairs where the query is a **class docstring**. | 13,355 | | `nasa_science_class_code_identifier_heldout.tsv` | Pairs where the query is a **class name** (identifier). | 12,540 | ---

提供机构：

nasa-impact

搜集汇总

数据集介绍

构建方式

在科学计算与航天工程领域，代码检索任务对于提升软件复用和知识发现效率至关重要。NASA科学代码检索基准数据集v0.1的构建源于对NASA GitHub仓库中七种编程语言（Python、C、C++、Java、JavaScript、Fortran和Matlab）源代码的系统性采集。该数据集采用信息检索的标准格式，包含语料库、查询集及查询相关性标注三部分。语料库条目通过唯一标识符与占位符文本构成，实际源代码需借助配套脚本从原始仓库获取，以尊重知识产权。查询集则涵盖自然语言描述（如文档字符串）与代码标识符（如函数或类名），并基于编程语言与查询类型细分为多个评估子集，形成一套结构严谨的保留测试集。

特点

本数据集作为评估信息检索模型性能的基准，其突出特点在于覆盖了航天科学中广泛使用的多种编程语言，体现了跨语言代码检索的实际需求。数据集规模庞大，包含近12万条唯一语料条目与近12万条唯一查询条目，确保了评估的统计可靠性。其设计注重模块化与可扩展性，通过独立的查询相关性标注文件，支持按语言或查询类型进行精细化性能分析。同时，数据集严格遵循知识产权规范，仅提供结构化元数据与标注，鼓励研究者在合规前提下自行获取源代码，从而在学术探索与法律约束间取得平衡。

使用方法

使用该数据集进行代码检索研究时，研究者需首先利用配套仓库提供的复制脚本，根据语料库中的标识符获取实际的源代码内容，以构建完整的检索语料。评估过程依托于标准的语料库、查询集及查询相关性标注文件，可通过计算检索模型在给定查询下返回相关代码片段的能力来衡量性能。数据集支持多维度评估，既可针对特定编程语言（如Python或Fortran）进行分析，也可按查询类型（如文档字符串或标识符）考察模型的不同表现。此外，该数据集与NASA发布的其他训练与验证集相配套，为模型训练与测试提供了连贯的实验环境，助力于航天科学代码检索技术的系统化发展。

背景与挑战

背景概述

在科学计算与软件工程交叉领域，代码检索技术对于提升科研代码复用性和知识发现效率具有关键意义。NASA科学代码检索基准v0.1由NASA-IMPACT团队创建，旨在构建一个基于自然语言查询的跨编程语言代码检索测试集。该数据集源自NASA GitHub仓库中Python、C、C++、Java、JavaScript、Fortran和Matlab七种语言的代码，核心研究问题聚焦于如何通过文档字符串或代码标识符等自然语言描述，精准检索出相关的代码片段。作为官方保留测试集，它为评估信息检索模型在科学代码领域的性能提供了标准化基准，推动了代码智能与软件工程研究的深度融合。

当前挑战

该数据集致力于解决科学代码检索领域的核心挑战，即如何跨越多样化的编程语言和领域特定术语，实现自然语言查询与代码语义的精准对齐。构建过程中面临多重挑战：在数据收集阶段，需协调不同源代码仓库的许可协议与知识产权限制，因而数据集仅提供结构化元数据与占位符，用户需通过复制脚本自行获取原始代码内容；在处理多语言代码时，需统一不同语言的语法结构差异，并确保查询类型（如函数文档字符串与类名标识符）的标注一致性；此外，科学代码常包含复杂算法与专业术语，对检索模型的领域适应性与泛化能力提出了更高要求。

常用场景

经典使用场景

在科学计算与软件工程领域，代码检索任务对于提升开发效率至关重要。NASA科学代码检索基准数据集v0.1作为保留测试集，专门用于评估信息检索模型在自然语言查询与多语言代码片段匹配方面的性能。其经典使用场景涉及给定一个自然语言查询（如函数文档字符串或代码标识符），从包含Python、C、C++、Java、JavaScript、Fortran和Matlab七种编程语言的代码语料库中检索出最相关的代码片段。这一场景模拟了实际开发中开发者通过描述性文本寻找现有代码实现的过程，为模型提供了标准化的评估框架。

解决学术问题

该数据集主要解决了代码检索研究中的关键学术问题，包括跨编程语言的语义匹配挑战、自然语言与代码之间的鸿沟问题，以及大规模科学代码库中的信息检索效率问题。通过提供结构化元数据、查询-相关性标注和多语言覆盖，它促进了检索模型在真实科学计算环境下的泛化能力评估。其意义在于为学术界提供了一个权威的基准，推动了代码智能领域的发展，帮助研究者设计更精准的算法，以应对复杂科学软件中的代码发现需求。

衍生相关工作

围绕该数据集，衍生了一系列经典研究工作，包括基于NASA-IMPACT组织的配套训练集（如nasa-science-function-code-docstring）开发的检索模型优化方法。相关项目如github-code-discovery提供了数据复制脚本，支持研究者构建完整语料库；而nasa-science-repo-benchmark系统则扩展了基准评估能力。这些工作共同推动了代码检索技术在科学计算社区的普及，并为后续版本（如v0.1.1）的改进奠定了基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集