Alienmaster/wikipedia_leipzig_de_2021

Name: Alienmaster/wikipedia_leipzig_de_2021
Creator: Alienmaster
Published: 2024-04-18 13:56:34
License: 暂无描述

Hugging Face2024-04-18 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Alienmaster/wikipedia_leipzig_de_2021

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含2021年从德语维基百科收集的不同大小的数据分割（从10k到1mio）。每个数据元素都被标记为neutral。数据集的语言为德语，属于单语种，许可证为cc-by-sa-4.0，规模类别为100K<n<1M，任务类别为文本分类。数据集的来源可以在提供的链接中找到。

提供机构：

Alienmaster

原始信息汇总

数据集概述

基本信息

语言: 德语 (de)
多语言性: 单语
许可证: CC-BY-SA-4.0
大小类别: 100K<n<1M
任务类别: 文本分类
美观名称: Leipzig Corpora Wikipedia 2021 German

配置详情

配置名称: default
数据文件:
- 10k: 路径为 "10k.parquet"
- 30k: 路径为 "30k.parquet"
- 100k: 路径为 "100k.parquet"
- 1mio: 路径为 "1mio.parquet"

数据描述

内容来源: 2021年的德语维基百科
数据收集时间: 2021年
标签: 每个元素均标记为“中性”

引用信息

@inproceedings{goldhahn-etal-2012-building, title = "Building Large Monolingual Dictionaries at the {L}eipzig Corpora Collection: From 100 to 200 Languages", author = "Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe", editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Do{u{g}}an, Mehmet U{u{g}}ur and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}12)", month = may, year = "2012", address = "Istanbul, Turkey", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf", pages = "759--765", abstract = "The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of low density, where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集