stefan-it/germeval14_no_wikipedia
收藏Hugging Face2024-05-29 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/stefan-it/germeval14_no_wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- token-classification
language:
- de
---
# Filtered GermEval 2014 NER Dataset
This repository hosts a filtered version of the great [GermEval 2014](https://sites.google.com/site/germeval2014ner/) NER Dataset.
After some analysis of the annotated examples in this dataset, it can be seen that the dataset is highly biased by Wikipedia articles.
# Dataset Stats
We present an overview of the top 10 top-level domains where annotations were retrieved from for training, development and test splits:
## Training Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| wikipedia.org | 12,007 (50.03%) |
| welt.de | 662 (2.76%) |
| spiegel.de | 512 (2.13%) |
| tagesspiegel.de | 424 (1.77%) |
| handelsblatt.com | 369 (1.54%) |
| fr-aktuell.de | 344 (1.43%) |
| sueddeutsche.de | 308 (1.28%) |
| abendblatt.de | 283 (1.18%) |
| berlinonline.de | 255 (1.06%) |
| szon.de | 249 (1.04%) |
## Development Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| wikipedia.org | 1,119 (50.86%) |
| welt.de | 46 (2.09%) |
| spiegel.de | 43 (1.95%) |
| fr-aktuell.de | 38 (1.73%) |
| tagesspiegel.de | 37 (1.68%) |
| handelsblatt.com | 35 (1.59%) |
| sueddeutsche.de | 28 (1.27%) |
| szon.de | 25 (1.14%) |
| feedsportal.com | 24 (1.09%) |
| berlinonline.de | 22 (1.0%) |
## Test Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| wikipedia.org | 2,547 (49.94%) |
| welt.de | 139 (2.73%) |
| spiegel.de | 88 (1.73%) |
| tagesspiegel.de | 86 (1.69%) |
| handelsblatt.com | 84 (1.65%) |
| sueddeutsche.de | 78 (1.53%) |
| abendblatt.de | 72 (1.41%) |
| fr-aktuell.de | 62 (1.22%) |
| berlinonline.de | 59 (1.16%) |
| szon.de | 57 (1.12%) |
## Summary
For each dataset split it can be seen, that the portion of annotated examples from Wikipedia are around 50%!
# Filtered Version & Motivation
We now create a Wikipedia-filtered-out version of the GermEval 2014 dataset. Here's one scenario for the main motivation:
Imagine you are pretraining a nice language model and you want to measure performance on GermEval 2014 for named entity recognition. Additionally, you want of course to
compare performance to other existing language models.
What would be the easiest way to get high performance on GermEval 2014 dataset? Yes, you can literally pretrain a language model on Wikipedia only (just as [I did](https://huggingface.co/gwlms))!
It will outperform models that are even pretrained on 100+ GB! See the great [ScandEval leaderboard](https://scandeval.com/german-nlu/) and have a look at the `gwlms` models.
However, the model performance for this pretrained model on Wikipedia-only will be worse on other downstream tasks such as Question Answering.
So this Wikipedia-filtered-out version could help to achieve better comparisons between LMs.
## Stats for Filtered Version
Additionally, we now present the stats for the filtered version of GermEval 2014 dataset:
### Training Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| welt.de | 662 (5.52%) |
| spiegel.de | 512 (4.27%) |
| tagesspiegel.de | 424 (3.54%) |
| handelsblatt.com | 369 (3.08%) |
| fr-aktuell.de | 344 (2.87%) |
| sueddeutsche.de | 308 (2.57%) |
| abendblatt.de | 283 (2.36%) |
| berlinonline.de | 255 (2.13%) |
| szon.de | 249 (2.08%) |
| n-tv.de | 195 (1.63%) |
### Development Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| welt.de | 46 (4.26%) |
| spiegel.de | 43 (3.98%) |
| fr-aktuell.de | 38 (3.52%) |
| tagesspiegel.de | 37 (3.42%) |
| handelsblatt.com | 35 (3.24%) |
| sueddeutsche.de | 28 (2.59%) |
| szon.de | 25 (2.31%) |
| feedsportal.com | 24 (2.22%) |
| berlinonline.de | 22 (2.04%) |
| rp-online.de | 21 (1.94%) |
### Test Split
| TLD | Number of examples (Percentage) |
|:---------------------|:--------------------------------- |
| welt.de | 139 (5.44%) |
| spiegel.de | 88 (3.45%) |
| tagesspiegel.de | 86 (3.37%) |
| handelsblatt.com | 84 (3.29%) |
| sueddeutsche.de | 78 (3.06%) |
| abendblatt.de | 72 (2.82%) |
| fr-aktuell.de | 62 (2.43%) |
| berlinonline.de | 59 (2.31%) |
| szon.de | 57 (2.23%) |
| feedsportal.com | 52 (2.04%) |
# Dataset Creation
We provide a notebook that shows how to recreate this filtered version of GermEval 2014. It can be found [here](https://huggingface.co/datasets/stefan-it/germeval14_no_wikipedia/blob/main/CreateDataset.ipynb).
Additionally, we provide a dataset loader for the awesome Flair library!
# Licence
We keep the original license of GermEval 2014 dataset ( CC-BY-4.0).
提供机构:
stefan-it
原始信息汇总
Filtered GermEval 2014 NER Dataset 概述
数据集基本信息
- 许可证: CC-BY-4.0
- 任务类别: 词元分类
- 语言: 德语
数据集统计
- 原始数据集偏差: 数据集主要来源于Wikipedia文章,各分割中约50%的标注例子来自wikipedia.org。
数据集分割详情
-
训练分割:
- 主要来源: wikipedia.org (50.03%)
- 其他来源: welt.de, spiegel.de, tagesspiegel.de 等
-
开发分割:
- 主要来源: wikipedia.org (50.86%)
- 其他来源: welt.de, spiegel.de, fr-aktuell.de 等
-
测试分割:
- 主要来源: wikipedia.org (49.94%)
- 其他来源: welt.de, spiegel.de, tagesspiegel.de 等
过滤版本数据集统计
-
过滤目的: 为了更公平地比较不同语言模型的性能,移除了来自Wikipedia的数据。
-
训练分割:
- 主要来源: welt.de (5.52%)
- 其他来源: spiegel.de, tagesspiegel.de, handelsblatt.com 等
-
开发分割:
- 主要来源: welt.de (4.26%)
- 其他来源: spiegel.de, fr-aktuell.de, tagesspiegel.de 等
-
测试分割:
- 主要来源: welt.de (5.44%)
- 其他来源: spiegel.de, tagesspiegel.de, handelsblatt.com 等
数据集创建
- 提供了一个Jupyter笔记本,展示如何重新创建这个过滤版本的数据集。
许可证
- 保留了原始GermEval 2014数据集的许可证 (CC-BY-4.0)。



