liuhyuu/NetEaseCrowd
收藏🧑🤝🧑 NetEaseCrowd: A Dataset for Long-term and Online Crowdsourcing Truth Inference
Introduction
NetEaseCrowd is a large-scale crowdsourcing annotation dataset based on a mature Chinese data crowdsourcing platform of NetEase Inc. It contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations collected over approximately 6 months. The dataset provides ground truths for all tasks and records timestamps for all annotations.
Task
The dataset is built based on a gesture comparison task. Each task contains three choices, where two are similar gestures and the other one is different. Annotators are required to pick out the different one.
Comparison with Existing Datasets
Compared to existing crowdsourcing datasets, NetEaseCrowd has the following characteristics:
| Characteristic | Existing Datasets | NetEaseCrowd Dataset |
|---|---|---|
| Scalability | Small sizes | Large-scale with 6 million annotations |
| Timestamps | No timestamps | Complete timestamps over a 6-month period |
| Task Type | Single type | Various task types with different capabilities |
Dataset Statistics
The basic statistics of NetEaseCrowd and other datasets are as follows:
| Dataset | #Worker | #Task | #Groundtruth | #Anno | Avg(#Anno/worker) | Avg(#Anno/task) | Timestamp | Task type |
|---|---|---|---|---|---|---|---|---|
| NetEaseCrowd | 2,413 | 999,799 | 999,799 | 6,016,319 | 2,493.3 | 6.0 | ✔︎ | Multiple |
| Adult | 825 | 11,040 | 333 | 92,721 | 112.4 | 8.4 | ✘ | Single |
| Birds | 39 | 108 | 108 | 4,212 | 108.0 | 39.0 | ✘ | Single |
| Dog | 109 | 807 | 807 | 8,070 | 74.0 | 10.0 | ✘ | Single |
| CF | 461 | 300 | 300 | 1,720 | 3.7 | 5.7 | ✘ | Single |
| CF_amt | 110 | 300 | 300 | 6030 | 54.8 | 20.1 | ✘ | Single |
| Emotion | 38 | 700 | 565 | 7,000 | 184.2 | 10.0 | ✘ | Single |
| Smile | 64 | 2,134 | 159 | 30,319 | 473.7 | 14.2 | ✘ | Single |
| Face | 27 | 584 | 584 | 5,242 | 194.1 | 9.0 | ✘ | Single |
| Fact | 57 | 42,624 | 576 | 216,725 | 3802.2 | 5.1 | ✘ | Single |
| MS | 44 | 700 | 700 | 2,945 | 66.9 | 4.2 | ✘ | Single |
| Product | 176 | 8,315 | 8,315 | 24,945 | 141.7 | 3.0 | ✘ | Single |
| RTE | 164 | 800 | 800 | 8,000 | 48.8 | 10.0 | ✘ | Single |
| Sentiment | 1,960 | 98,980 | 1,000 | 569,375 | 290.5 | 5.8 | ✘ | Single |
| SP | 203 | 4,999 | 4,999 | 27,746 | 136.7 | 5.6 | ✘ | Single |
| SP_amt | 143 | 500 | 500 | 10,000 | 69.9 | 20.0 | ✘ | Single |
| Trec | 762 | 19,033 | 2,275 | 88,385 | 116.0 | 4.6 | ✘ | Single |
| Tweet | 85 | 1,000 | 1,000 | 20,000 | 235.3 | 20.0 | ✘ | Single |
| Web | 177 | 2,665 | 2,653 | 15,567 | 87.9 | 5.8 | ✘ | Single |
| ZenCrowd_us | 74 | 2,040 | 2,040 | 12,190 | 164.7 | 6.0 | ✘ | Single |
| ZenCrowd_in | 25 | 2,040 | 2,040 | 11,205 | 448.2 | 5.5 | ✘ | Single |
| ZenCrowd_all | 78 | 2,040 | 2,040 | 21,855 | 280.2 | 10.7 | ✘ | Single |
Data Content and Format
Obtain the Data
The dataset can be accessed in two ways:
- Directly download from Hugging Face.
- Download partitions from the GitHub repository and concatenate them.
Dataset Format
Each record in the dataset represents an interaction between a worker and a task, with the following columns:
- taskId: Unique id of the annotated task.
- tasksetId: Unique id of the task set.
- workerId: Unique id of the worker.
- answer: Annotation given by the worker.
- completeTime: Timestamp of annotation completion.
- truth: Ground truth of the annotated task.
- capability: Id of the capability required by the task set.
Data Sample
| tasksetId | taskId | workerId | answer | completeTime | truth | capability |
|---|---|---|---|---|---|---|
| 6980 | 1012658482844795232 | 64 | 2 | 1661917345953 | 1 | 69 |
| 6980 | 1012658482844795232 | 150 | 1 | 1661871234755 | 1 | 69 |
| 6980 | 1012658482844795232 | 263 | 0 | 1661855450281 | 1 | 69 |
Baseline Models
Several truth inference methods have been tested on the dataset, with results as follows:
| Method | Accuracy | F1-score |
|---|---|---|
| MV | 0.92695 | 0.92692 |
| DS | 0.95178 | 0.94817 |
| MACE | 0.95991 | 0.94957 |
| Wawa | 0.94814 | 0.94445 |
| ZeroBasedSkill | 0.94898 | 0.94585 |
| GLAD | 0.95064 | 0.95058 |
| EBCC | 0.91071 | 0.90996 |
| ZC | 0.95305 | 0.95301 |
| TiReMGE | 0.92713 | 0.92706 |
| LAA | 0.94173 | 0.94169 |
| BiLA | 0.88036 | 0.87896 |
License
The NetEaseCrowd dataset is licensed under CC-BY-SA-4.0.




