five

Alienmaster/wikipedia_leipzig_de_2021

收藏
Hugging Face2024-04-18 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Alienmaster/wikipedia_leipzig_de_2021
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含2021年从德语维基百科收集的不同大小的数据分割(从10k到1mio)。每个数据元素都被标记为neutral。数据集的语言为德语,属于单语种,许可证为cc-by-sa-4.0,规模类别为100K<n<1M,任务类别为文本分类。数据集的来源可以在提供的链接中找到。

该数据集包含2021年从德语维基百科收集的不同大小的数据分割(从10k到1mio)。每个数据元素都被标记为neutral。数据集的语言为德语,属于单语种,许可证为cc-by-sa-4.0,规模类别为100K<n<1M,任务类别为文本分类。数据集的来源可以在提供的链接中找到。
提供机构:
Alienmaster
原始信息汇总

数据集概述

基本信息

  • 语言: 德语 (de)
  • 多语言性: 单语
  • 许可证: CC-BY-SA-4.0
  • 大小类别: 100K<n<1M
  • 任务类别: 文本分类
  • 美观名称: Leipzig Corpora Wikipedia 2021 German

配置详情

  • 配置名称: default
  • 数据文件:
    • 10k: 路径为 "10k.parquet"
    • 30k: 路径为 "30k.parquet"
    • 100k: 路径为 "100k.parquet"
    • 1mio: 路径为 "1mio.parquet"

数据描述

  • 内容来源: 2021年的德语维基百科
  • 数据收集时间: 2021年
  • 标签: 每个元素均标记为“中性”

引用信息

@inproceedings{goldhahn-etal-2012-building, title = "Building Large Monolingual Dictionaries at the {L}eipzig Corpora Collection: From 100 to 200 Languages", author = "Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe", editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Do{u{g}}an, Mehmet U{u{g}}ur and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}12)", month = may, year = "2012", address = "Istanbul, Turkey", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf", pages = "759--765", abstract = "The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of “low density”, where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作