liuhyuu/NetEaseCrowd

Name: liuhyuu/NetEaseCrowd
Creator: liuhyuu
Published: 2024-06-05 09:31:30
License: 暂无描述

Hugging Face2024-06-05 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/liuhyuu/NetEaseCrowd

下载链接

链接失效反馈

官方服务：

资源简介：

NetEaseCrowd是一个基于网易公司成熟众包平台的大规模众包注释数据集。该数据集包含约2,400名工人、1,000,000个任务和6,000,000个注释，注释收集时间跨度为6个月。数据集提供了所有任务的地面真值，并记录了所有注释的时间戳。任务类型为手势比较任务，要求注释者选出不同的手势。数据集的特点包括大规模数据收集、完整的时间戳记录和多种任务类型。数据格式为CSV文件，每条记录代表工人与任务之间的交互，包含任务ID、任务集ID、工人ID、答案、完成时间、真值和能力ID等信息。

提供机构：

liuhyuu

原始信息汇总

🧑‍🤝‍🧑 NetEaseCrowd: A Dataset for Long-term and Online Crowdsourcing Truth Inference

Introduction

NetEaseCrowd is a large-scale crowdsourcing annotation dataset based on a mature Chinese data crowdsourcing platform of NetEase Inc. It contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over approximately 6 months. The dataset provides ground truths for all tasks and records timestamps for all annotations.

Task

The dataset is built based on a gesture comparison task. Each task contains three choices, where two are similar gestures and the other one is different. Annotators are required to pick out the different one.

Comparison with Existing Datasets

Compared to existing crowdsourcing datasets, NetEaseCrowd has the following characteristics:

Characteristic	Existing Datasets	NetEaseCrowd Dataset
Scalability	Small sizes	Large-scale with 6 million annotations
Timestamps	No timestamps	Complete timestamps over a 6-month period
Task Type	Single type	Various task types with different capabilities

Dataset Statistics

The basic statistics of NetEaseCrowd and other datasets are as follows:

Dataset	#Worker	#Task	#Groundtruth	#Anno	Avg(#Anno/worker)	Avg(#Anno/task)	Timestamp	Task type
NetEaseCrowd	2,413	999,799	999,799	6,016,319	2,493.3	6.0	✔︎	Multiple
Adult	825	11,040	333	92,721	112.4	8.4	✘	Single
Birds	39	108	108	4,212	108.0	39.0	✘	Single
Dog	109	807	807	8,070	74.0	10.0	✘	Single
CF	461	300	300	1,720	3.7	5.7	✘	Single
CF_amt	110	300	300	6030	54.8	20.1	✘	Single
Emotion	38	700	565	7,000	184.2	10.0	✘	Single
Smile	64	2,134	159	30,319	473.7	14.2	✘	Single
Face	27	584	584	5,242	194.1	9.0	✘	Single
Fact	57	42,624	576	216,725	3802.2	5.1	✘	Single
MS	44	700	700	2,945	66.9	4.2	✘	Single
Product	176	8,315	8,315	24,945	141.7	3.0	✘	Single
RTE	164	800	800	8,000	48.8	10.0	✘	Single
Sentiment	1,960	98,980	1,000	569,375	290.5	5.8	✘	Single
SP	203	4,999	4,999	27,746	136.7	5.6	✘	Single
SP_amt	143	500	500	10,000	69.9	20.0	✘	Single
Trec	762	19,033	2,275	88,385	116.0	4.6	✘	Single
Tweet	85	1,000	1,000	20,000	235.3	20.0	✘	Single
Web	177	2,665	2,653	15,567	87.9	5.8	✘	Single
ZenCrowd_us	74	2,040	2,040	12,190	164.7	6.0	✘	Single
ZenCrowd_in	25	2,040	2,040	11,205	448.2	5.5	✘	Single
ZenCrowd_all	78	2,040	2,040	21,855	280.2	10.7	✘	Single

Data Content and Format

Obtain the Data

The dataset can be accessed in two ways:

Directly download from Hugging Face.
Download partitions from the GitHub repository and concatenate them.

Dataset Format

Each record in the dataset represents an interaction between a worker and a task, with the following columns:

taskId: Unique id of the annotated task.
tasksetId: Unique id of the task set.
workerId: Unique id of the worker.
answer: Annotation given by the worker.
completeTime: Timestamp of annotation completion.
truth: Ground truth of the annotated task.
capability: Id of the capability required by the task set.

Data Sample

tasksetId	taskId	workerId	answer	completeTime	truth	capability
6980	1012658482844795232	64	2	1661917345953	1	69
6980	1012658482844795232	150	1	1661871234755	1	69
6980	1012658482844795232	263	0	1661855450281	1	69

Baseline Models

Several truth inference methods have been tested on the dataset, with results as follows:

Method	Accuracy	F1-score
MV	0.92695	0.92692
DS	0.95178	0.94817
MACE	0.95991	0.94957
Wawa	0.94814	0.94445
ZeroBasedSkill	0.94898	0.94585
GLAD	0.95064	0.95058
EBCC	0.91071	0.90996
ZC	0.95305	0.95301
TiReMGE	0.92713	0.92706
LAA	0.94173	0.94169
BiLA	0.88036	0.87896

License

The NetEaseCrowd dataset is licensed under CC-BY-SA-4.0.

搜集汇总

数据集介绍

背景与挑战

背景概述

NetEaseCrowd是一个大规模众包标注数据集，包含约2,400名工作者、1,000,000个任务和6,000,000个标注，时间跨度为6个月，适用于长期和在线众包真实推断研究。数据集提供了所有任务的真实标签和标注的时间戳，具有较高的实用性和研究价值。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集