NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python
收藏DataCite Commons2023-01-27 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/NICHE_A_Curated_Dataset_of_Engineered_Machine_Learning_Projects_in_Python/21967265
下载链接
链接失效反馈官方服务:
资源简介:
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects. <br> GitHub page: https://github.com/soarsmu/NICHE
机器学习(Machine Learning,ML)已受到广泛关注,并融入了我们的日常生活。尽管GitHub等开源平台上存在大量公开可用的机器学习项目,但目前针对这些项目进行筛选以整理高质量机器学习项目的尝试却十分有限。此类高质量数据集的匮乏,成为了理解机器学习项目的一大阻碍。为助力突破这一阻碍,我们推出了NICHE数据集——这是一个包含572个机器学习项目的人工标注数据集。基于良好软件工程实践的相关依据,我们将其中441个项目标注为工程化项目,剩余131个标注为非工程化项目。本代码仓库中提供了"NICHE.csv"文件,其中包含项目名称列表及其对应标签、各维度的描述性信息,以及星标数、提交量等多项基础统计数据。该数据集可帮助研究人员理解高质量机器学习项目所遵循的实践规范,同时也可作为用于识别工程化机器学习项目的分类器的基准测试数据集。
GitHub 页面:https://github.com/soarsmu/NICHE
提供机构:
figshare
创建时间:
2023-01-27



