and-effect/mdk_gov_data_titles_clf

Name: and-effect/mdk_gov_data_titles_clf
Creator: and-effect
Published: 2023-05-25 12:43:42
License: 暂无描述

Hugging Face2023-05-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/and-effect/mdk_gov_data_titles_clf

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: crowdsourced language_creators: other language: de multilinguality: monolingual size_categories: - 1K<n<10K source_datasets: extended task_categories: - text-classification pretty_name: GOVDATA dataset titles labelled license: cc-by-4.0 --- # Dataset Card for MDK This dataset was created as part of the [Bertelsmann Foundation's](https://www.bertelsmann-stiftung.de/de/startseite) [Musterdatenkatalog (MDK)]("https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog") project. The MDK provides an overview of Open Data in municipalities in Germany. It is intended to help municipalities in Germany, as well as data analysts and journalists, to get an overview of the topics and the extent to which cities have already published data sets. ## Dataset Description ### Dataset Summary The dataset is an annotated corpus of 1258 records based on the metadata of the datasets from [GOVDATA](https://www.govdata.de/). GovData is a data portal that aims to make cities' data available in a standardized way. The annotation maps the titles of the datasets to a taxonomy containing categories such as 'Verkehr - KFZ - Messung' or 'Abfallwirtschaft - Abfallkalender'. Through the assignment the names of the data sets can be normalized and grouped. In total, the taxonomy consists 250 categories. Each category is divided into two levels: - Level 1: "Thema" (topic) ![](taxonomy_elinor.png) - Level 2: "Bezeichnung" (label). The first dash divides the levels. For example: ![](topic_label_example.png) You can find an interactive view of the taxonomy with all labels [here](https://huggingface.co/spaces/and-effect/Musterdatenkatalog). The repository contains a small and a large version of the data. The small version is for testing purposes only. The large data set contains all 1258 entries. The large and small datasets are split into a training and a testing dataset. In addition, the large dataset folder contains of a validation dataset that has been annotated separately. The validation dataset is an additional dataset that we created for the evaluation of the algorithm. It also consists of data from GOVDATA and has the same structure as the test and training data set. ### Languages The language data is German. ## Dataset Structure ### Data Fields | dataset | size | |-----|-----| | small/train | 18.96 KB | | small/test | 6.13 KB | | large/train | 517.77 KB | | large/test | 118.66 KB | An example of looks as follows: ```json { "doc_id": "a063d3b7-4c09-421e-9849-073dc8939e76", "title": "Dienstleistungen Alphabetisch sortiert April 2019", "description": "CSV-Datei mit allen Dienstleistungen der Kreisverwaltung Kleve. Sortiert nach AlphabetStand 01.04.2019", "labels_name": "Sonstiges - Sonstiges", "labels": 166 } ``` The data fields are the same among all splits: - doc_id (uuid): identifier for each document - title (str): dataset title from GOVDATA - description (str): description of the dataset - labels_name (str): annotation with labels from taxonomy - labels (int): labels indexed from 0 to 250 ### Data Splits | dataset_name | dataset_splits | train_size | test_size | validation_size |-----|-----|-----|-----|-----| | dataset_large | train, test, validation | 1009 | 249 | 101 | dataset_small | train, test | 37 | 13 | None ## Dataset Creation The dataset was created through multiple manual annotation rounds. ### Source Data The data comes from [GOVDATA](https://www.govdata.de/), an open data portal of Germany. It aims to provide central access to administrative data from the federal, state and local governments. Their aim is to make data available in one place and thus easier to use. The data available is structured in 13 categories ranging from finance, to international topics, health, education and science and technology. [GOVDATA](https://www.govdata.de/) offers a [CKAN API](https://ckan.govdata.de/) to make requests and provides metadata for each data entry. #### Initial Data Collection and Normalization Several sources were used for the annotation process. A sample was collected from [GOVDATA](https://www.govdata.de/) with actual datasets. For the sample, 50 records were drawn for each group. Additional samples are from the previous version of the [MDK](https://github.com/bertelsmannstift/Musterdatenkatalog) that contain older data from [GOVDATA](https://www.govdata.de/). Some of the datasets from the old [MDK](https://github.com/bertelsmannstift/Musterdatenkatalog) already contained an annotation, but since the taxonomy is not the same, the data were re-annotated. A sample was drawn from each source (randomly and by manual selection), resulting in a total of 1258 titles. ### Annotations #### Annotation process The data was annotated in four rounds and one additional test round. In each round a percentage of the data was allocated to all annotators to caluculate the inter-annotator agreement using Cohens Kappa. The following table shows the results of the of the annotations: | | **Cohens Kappa** | **Number of Annotators** | **Number of Documents** | | ------------------ | :--------------: | ------------------------ | ----------------------- | | **Test Round** | .77 | 6 | 50 | | **Round 1** | .41 | 2 | 120 | | **Round 2** | .76 | 4 | 480 | | **Round 3** | .71 | 3 | 420 | | **Round 4** | .87 | 2 | 416 | | **Validation set** | - | 1 | 177 | In addition, a validation set was generated by the dataset curators. #### Who are the annotators? Annotators are all employees from [&effect data solutions GmbH](https://www.and-effect.com/). The taxonomy as well as rules and problems in the assignment of datasets were discussed and debated in advance of the development of the taxonomy and the annotation in two workshops with experts and representatives of the open data community and local governments as well as with the project members of the [Musterdatenkatalog]("https://www.bertelsmann-stiftung.de/de/unsere-projekte/smart-country/musterdatenkatalog") from the Bertelsmann Foundation. On this basis, the [&effect](https://www.and-effect.com/) employees were instructed in the annotation by the curators of the datasets. ## Considerations for Using the Data The dataset for the annotation process was generated by sampling from [GOVDATA](https://www.govdata.de/) and data previously collected from GOVDATA. The data on GOVDATA is continuously updated and data can get deleted. Thus, there is no guarantee that data entries included here will still be available. ### Social Impact of Dataset Since 2017, the German government has been promoting systematic and free access to public administration data with first laws on open data in municipalities. In this way, a contribution is aimed at the development of a [knowledge society] (https://www.verwaltung-innovativ.de/DE/Startseite/startseite_node.html). The categorization of open data of cities in a standardized and detailed taxonomy supports this process of making data of municipalities freely, openly and structured accessible. ### Discussion of Biases (non-ethical) The data was mainly sampled at random from the categories available on GOVDATA. Although all categories were sampled there is still some imbalance in the data. For example: entries for the concept 'Raumordnung, Raumplanung und Raumentwicklung - Bebauungsplan' make up the majority class. Although manual selection of data was also used for not all previous concepts data entries was found. However, for 95% of concepts at least one data entry is available. ## Additional Information ### Dataset Curators Friederike Bauer Rahkakavee Baskaran ### Licensing Information CC BY 4.0

提供机构：

and-effect

原始信息汇总

数据集概述

基本信息

名称: GOVDATA dataset titles labelled
语言: 德语（de）
许可证: CC BY 4.0
多语言性: 单语种
任务类别: 文本分类
大小: 1K<n<10K
来源数据集: 扩展自GOVDATA

数据集描述

概要: 该数据集包含1258条记录，基于GOVDATA数据集的元数据进行标注。GOVDATA是一个旨在标准化城市数据发布的门户网站。
标注: 数据集的标题被映射到一个包含250个类别的分类法中，每个类别分为两个级别：主题（Thema）和标签（Bezeichnung）。

数据集结构

数据字段:
- doc_id (uuid): 文档标识符
- title (str): 来自GOVDATA的数据集标题
- description (str): 数据集描述
- labels_name (str): 分类法中的标签注释
- labels (int): 从0到250的标签索引
数据分割:
- 大型数据集: 训练集（1009条记录）、测试集（249条记录）、验证集（101条记录）
- 小型数据集: 训练集（37条记录）、测试集（13条记录）

数据集创建

源数据: 数据来自德国的开放数据门户GOVDATA。
初始数据收集和标准化: 从GOVDATA和旧版本的MDK中随机和手动选择样本，总计1258个标题。
标注过程: 数据经过四轮标注和一个测试轮，使用Cohens Kappa计算标注者间一致性。
标注者: 所有标注者来自&effect data solutions GmbH。

使用数据集的考虑

数据可用性: 由于GOVDATA数据持续更新，数据集中的数据条目可能不再可用。

社会影响

影响: 数据集支持德国政府推动的公共管理数据开放政策，促进知识社会的发展。

偏见讨论

数据不平衡: 尽管所有类别都被采样，但数据中仍存在一些不平衡，例如“空间规划和开发 - 建筑规划”类别占多数。

5,000+

优质数据集

54 个

任务类型

进入经典数据集