GIGO revisited: ML publications' approaches to training data

Name: GIGO revisited: ML publications' approaches to training data
Creator: OpenDataLab
Published: 2026-06-07 08:30:18
License: 暂无描述

OpenDataLab2026-06-07 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/GIGO_revisited_ML_publications_etc

下载链接

链接失效反馈

官方服务：

资源简介：

200 篇机器学习出版物的随机样本，由一组标注员系统分析，他们询问了多达 15 个关于出版物如何讨论其训练数据的问题。监督机器学习，其中模型自动从标记的训练数据派生，仅作为与该数据的质量一样好。这项研究建立在先前的工作基础上，该工作调查了在单个域（社交媒体平台）内的应用 ML 出版物中，在多大程度上遵循了关于标记训练数据的“最佳实践”。在本文中，我们通过研究在更广泛的学科中应用监督机器学习的出版物进行扩展，重点关注人工标记的数据。我们报告了跨学科的 ML 应用论文的随机样本在多大程度上提供了有关是否遵循最佳实践的具体细节，同时承认更大范围的应用领域必然会产生更多样化的标签和注释方法。因为大部分机器学习研究和教育只关注在训练数据的“基本事实”或“黄金标准”可用时所做的事情，因此围绕此类数据是否可靠这一同样重要的方面讨论问题尤其重要首先。当应用于各种专业领域时，这种确定变得越来越复杂，因为标签的范围可以从几乎不需要背景知识的任务到必须由具有职业专业知识的人执行的任务。

A random sample of 200 machine learning publications was systematically analyzed by a team of annotators, who asked up to 15 questions regarding how the publications discuss their training data. Supervised machine learning, in which models automatically derive from labeled training data, is only as good as the quality of that data. This research builds upon prior work that investigated the extent to which applied ML publications within a single domain (social media platforms) followed "best practices" for labeled training data. In this paper, we expand this scope by examining publications that apply supervised machine learning across a broader range of disciplines, with a focus on manually labeled data. We report on the extent to which a random sample of interdisciplinary ML application papers provides specific details regarding whether they followed best practices, while acknowledging that a wider range of application domains necessarily gives rise to more diverse labeling and annotation methods. Because most machine learning research and education focuses solely on the scenarios where "ground truth" or "gold standard" training data is available, it is particularly important to first discuss questions surrounding the equally critical aspect of whether such data is reliable. This determination becomes increasingly complex when applied to various professional fields, as labeling tasks can range from those requiring little to no background knowledge to those that must be performed by individuals with professional expertise.

提供机构：

OpenDataLab

创建时间：

2022-06-28

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集基于200篇机器学习出版物的随机样本，通过标注员系统分析其如何讨论人工标记的训练数据，重点关注跨学科监督机器学习应用中是否遵循最佳实践。研究扩展了先前工作，强调在训练数据可靠性方面进行讨论的重要性，尤其是在专业领域应用中的复杂性。

以上内容由遇见数据集搜集并总结生成