coral-nlp/german-commons
收藏Hugging Face2026-01-22 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/coral-nlp/german-commons
下载链接
链接失效反馈官方服务:
资源简介:
这是一个包含154.56亿个德语文本数据标记的德语语言模型训练数据集,涵盖了文化、政治、法律、新闻、经济、文化和科学等七个主题领域,数据来源于41个不同的来源。每个记录包含文档的唯一标识符、来源数据集名称、主题子集、文本内容、适用许可证、标记数量、文本困惑度和OCR质量分数等字段。
A comprehensive collection of German-language text data under open licenses for training German language models. The dataset contains 154.56 billion tokens of German text data with 35.78 million documents spanning 7 thematic domains: Web Commons, Political Commons, Legal Commons, News Commons, Economics Commons, Cultural Commons, and Scientific Commons. Each record contains fields such as id, source, subset, text, license, num_tokens, perplexity, and ocr_score.
提供机构:
coral-nlp



