nasa-impact/nasa-science-github-repos
收藏Hugging Face2026-04-07 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/nasa-impact/nasa-science-github-repos
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含5,264个与NASA科学任务理事会(SMD)相关的GitHub仓库的精选索引,旨在提高开源科学软件的可发现性。数据集从四个主要来源(SDE Dump、SME Curated Lists、EO Knowledge Graph和ASCL)聚合仓库,清理其文档,并通过从README中的链接爬取“附加上下文”来丰富它们。每个仓库使用基于LLM的流程分类到五个NASA SMD部门之一(或标记为不相关)。数据集支持的任务包括主题分类、信息检索和RAG(检索增强生成)。数据集的结构包括多个字段,如仓库名称、URL、描述、README内容等。数据创建过程包括源收集、过滤和上下文扩展以及分类。
This dataset contains a curated index of 5,264 GitHub repositories relevant to the NASA Science Mission Directorate (SMD). It was constructed to improve the discoverability of open-source scientific software that is often buried in vast code platforms. The dataset aggregates repositories from four primary sources (SDE Dump, SME Curated Lists, EO Knowledge Graph, and ASCL), cleans their documentation, and enriches them with additional context crawled from links within the READMEs. Each repository has been classified into one of the 5 NASA SMD Divisions (or marked as non-relevant) using an LLM-based pipeline. Supported tasks include Topic Classification, Information Retrieval, and RAG (Retrieval-Augmented Generation). The dataset structure includes multiple fields such as repository name, URL, description, README content, etc. The data creation process involves source collection, filtering and context expansion, and classification.
提供机构:
nasa-impact



