stefan-it/german-dbmdz-bert-corpus
收藏Hugging Face2023-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/stefan-it/german-dbmdz-bert-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
language:
- de
---
# German DBMDZ BERT Corpus
This datasets includes all corpora that were used for pretraining the [German DBMDZ BERT Models](https://github.com/dbmdz/berts?tab=readme-ov-file#german-bert).
It consists of Wikipedia dump and corpora from [OPUS](https://opus.nlpl.eu/):
| Filename | Description | Creation Date | File Size |
| ------------------- | ------------------ | ------------ | --------- |
| `dewiki.txt` | Wikipedia Dump | May 2019 | 5.1GB |
| `eubookshop.txt` | OPUS EUbookshop | November 2018 | 2.2GB |
| `news.2018.txt` | OPUS News corpora | January 2019 | 4.1GB |
| `opensubtitles.txt` | OPUS OpenSubtitles | November 2018 | 1.3GB |
| `paracrawl.txt` | OPUS ParaCrawl | November 2018 | 3.1GB |
提供机构:
stefan-it
原始信息汇总
German DBMDZ BERT Corpus
数据集概述
该数据集包含用于预训练German DBMDZ BERT Models的所有语料库。
数据组成
数据集由维基百科转储和来自OPUS的语料库组成。
文件列表
| 文件名 | 描述 | 创建日期 | 文件大小 |
|---|---|---|---|
dewiki.txt |
维基百科转储 | 2019年5月 | 5.1GB |
eubookshop.txt |
OPUS EUbookshop | 2018年11月 | 2.2GB |
news.2018.txt |
OPUS News corpora | 2019年1月 | 4.1GB |
opensubtitles.txt |
OPUS OpenSubtitles | 2018年11月 | 1.3GB |
paracrawl.txt |
OPUS ParaCrawl | 2018年11月 | 3.1GB |



