five

stefan-it/germeval14_no_wikipedia

收藏
Hugging Face2024-05-29 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/stefan-it/germeval14_no_wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - token-classification language: - de --- # Filtered GermEval 2014 NER Dataset This repository hosts a filtered version of the great [GermEval 2014](https://sites.google.com/site/germeval2014ner/) NER Dataset. After some analysis of the annotated examples in this dataset, it can be seen that the dataset is highly biased by Wikipedia articles. # Dataset Stats We present an overview of the top 10 top-level domains where annotations were retrieved from for training, development and test splits: ## Training Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | wikipedia.org | 12,007 (50.03%) | | welt.de | 662 (2.76%) | | spiegel.de | 512 (2.13%) | | tagesspiegel.de | 424 (1.77%) | | handelsblatt.com | 369 (1.54%) | | fr-aktuell.de | 344 (1.43%) | | sueddeutsche.de | 308 (1.28%) | | abendblatt.de | 283 (1.18%) | | berlinonline.de | 255 (1.06%) | | szon.de | 249 (1.04%) | ## Development Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | wikipedia.org | 1,119 (50.86%) | | welt.de | 46 (2.09%) | | spiegel.de | 43 (1.95%) | | fr-aktuell.de | 38 (1.73%) | | tagesspiegel.de | 37 (1.68%) | | handelsblatt.com | 35 (1.59%) | | sueddeutsche.de | 28 (1.27%) | | szon.de | 25 (1.14%) | | feedsportal.com | 24 (1.09%) | | berlinonline.de | 22 (1.0%) | ## Test Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | wikipedia.org | 2,547 (49.94%) | | welt.de | 139 (2.73%) | | spiegel.de | 88 (1.73%) | | tagesspiegel.de | 86 (1.69%) | | handelsblatt.com | 84 (1.65%) | | sueddeutsche.de | 78 (1.53%) | | abendblatt.de | 72 (1.41%) | | fr-aktuell.de | 62 (1.22%) | | berlinonline.de | 59 (1.16%) | | szon.de | 57 (1.12%) | ## Summary For each dataset split it can be seen, that the portion of annotated examples from Wikipedia are around 50%! # Filtered Version & Motivation We now create a Wikipedia-filtered-out version of the GermEval 2014 dataset. Here's one scenario for the main motivation: Imagine you are pretraining a nice language model and you want to measure performance on GermEval 2014 for named entity recognition. Additionally, you want of course to compare performance to other existing language models. What would be the easiest way to get high performance on GermEval 2014 dataset? Yes, you can literally pretrain a language model on Wikipedia only (just as [I did](https://huggingface.co/gwlms))! It will outperform models that are even pretrained on 100+ GB! See the great [ScandEval leaderboard](https://scandeval.com/german-nlu/) and have a look at the `gwlms` models. However, the model performance for this pretrained model on Wikipedia-only will be worse on other downstream tasks such as Question Answering. So this Wikipedia-filtered-out version could help to achieve better comparisons between LMs. ## Stats for Filtered Version Additionally, we now present the stats for the filtered version of GermEval 2014 dataset: ### Training Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | welt.de | 662 (5.52%) | | spiegel.de | 512 (4.27%) | | tagesspiegel.de | 424 (3.54%) | | handelsblatt.com | 369 (3.08%) | | fr-aktuell.de | 344 (2.87%) | | sueddeutsche.de | 308 (2.57%) | | abendblatt.de | 283 (2.36%) | | berlinonline.de | 255 (2.13%) | | szon.de | 249 (2.08%) | | n-tv.de | 195 (1.63%) | ### Development Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | welt.de | 46 (4.26%) | | spiegel.de | 43 (3.98%) | | fr-aktuell.de | 38 (3.52%) | | tagesspiegel.de | 37 (3.42%) | | handelsblatt.com | 35 (3.24%) | | sueddeutsche.de | 28 (2.59%) | | szon.de | 25 (2.31%) | | feedsportal.com | 24 (2.22%) | | berlinonline.de | 22 (2.04%) | | rp-online.de | 21 (1.94%) | ### Test Split | TLD | Number of examples (Percentage) | |:---------------------|:--------------------------------- | | welt.de | 139 (5.44%) | | spiegel.de | 88 (3.45%) | | tagesspiegel.de | 86 (3.37%) | | handelsblatt.com | 84 (3.29%) | | sueddeutsche.de | 78 (3.06%) | | abendblatt.de | 72 (2.82%) | | fr-aktuell.de | 62 (2.43%) | | berlinonline.de | 59 (2.31%) | | szon.de | 57 (2.23%) | | feedsportal.com | 52 (2.04%) | # Dataset Creation We provide a notebook that shows how to recreate this filtered version of GermEval 2014. It can be found [here](https://huggingface.co/datasets/stefan-it/germeval14_no_wikipedia/blob/main/CreateDataset.ipynb). Additionally, we provide a dataset loader for the awesome Flair library! # Licence We keep the original license of GermEval 2014 dataset ( CC-BY-4.0).
提供机构:
stefan-it
原始信息汇总

Filtered GermEval 2014 NER Dataset 概述

数据集基本信息

  • 许可证: CC-BY-4.0
  • 任务类别: 词元分类
  • 语言: 德语

数据集统计

  • 原始数据集偏差: 数据集主要来源于Wikipedia文章,各分割中约50%的标注例子来自wikipedia.org。

数据集分割详情

  • 训练分割:

    • 主要来源: wikipedia.org (50.03%)
    • 其他来源: welt.de, spiegel.de, tagesspiegel.de 等
  • 开发分割:

    • 主要来源: wikipedia.org (50.86%)
    • 其他来源: welt.de, spiegel.de, fr-aktuell.de 等
  • 测试分割:

    • 主要来源: wikipedia.org (49.94%)
    • 其他来源: welt.de, spiegel.de, tagesspiegel.de 等

过滤版本数据集统计

  • 过滤目的: 为了更公平地比较不同语言模型的性能,移除了来自Wikipedia的数据。

  • 训练分割:

    • 主要来源: welt.de (5.52%)
    • 其他来源: spiegel.de, tagesspiegel.de, handelsblatt.com 等
  • 开发分割:

    • 主要来源: welt.de (4.26%)
    • 其他来源: spiegel.de, fr-aktuell.de, tagesspiegel.de 等
  • 测试分割:

    • 主要来源: welt.de (5.44%)
    • 其他来源: spiegel.de, tagesspiegel.de, handelsblatt.com 等

数据集创建

  • 提供了一个Jupyter笔记本,展示如何重新创建这个过滤版本的数据集。

许可证

  • 保留了原始GermEval 2014数据集的许可证 (CC-BY-4.0)。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作