flax-community/german_common_crawl
收藏Hugging Face2023-10-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/flax-community/german_common_crawl
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
---
The dataset script is more or less ready and one file has correctly been converted so far: `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz`
You can try downloading the file as follows:
```python
from datasets import load_dataset
ds = load_dataset("flax-community/german_common_crawl", "first")
```
This can be done on your local computer and should only take around 2GB of disk space.
This however only loads the first of >100 files.
We now need to add **all** other files to this repo. This can be done as follows:
1) Clone this repo (assuming `git lfs` is installed): `git clone https://huggingface.co/datasets/flax-community/german_common_crawl`
2) For each file:
`https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2016-18.tar.gz` - `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz`
run the command `./convert_file.sh <file_name>` This command will download the file via `wget`, filter out all text that is below a threshold as explained here: https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz and then converts the file into the correct format.
3) Upload the file to this repo:
`git add . && git commit -m "add file x" && git push
Ideally this can be done in a loop on a computer that has enough CPU memory (Note that if this is done on a TPU VM, make sure to disable the TPU via `export JAX_PLATFORM_NAME=cpu`.
Also some description and file names have to be added correctly to the dataset.py script
提供机构:
flax-community
原始信息汇总
数据集概述
数据集名称
- 名称:german_common_crawl
- 归属:flax-community
数据集内容
- 语言:德语(de)
- 文件数量:超过100个文件
- 示例文件:
- 链接:
https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz - 格式:tar.gz
- 链接:
数据集处理
- 文件处理:使用脚本
convert_file.sh进行下载、过滤和格式转换。 - 过滤标准:文本内容需满足特定阈值。
数据集加载
-
加载方法:使用
load_dataset函数,如: python from datasets import load_dataset ds = load_dataset("flax-community/german_common_crawl", "first") -
存储需求:约2GB磁盘空间。
数据集维护
- 维护步骤:
- 克隆仓库:
git clone https://huggingface.co/datasets/flax-community/german_common_crawl - 处理每个文件:运行
./convert_file.sh <file_name> - 上传文件:执行
git add . && git commit -m "add file x" && git push
- 克隆仓库:
注意事项
- 在TPU VM上操作时,需禁用TPU:
export JAX_PLATFORM_NAME=cpu - 需更新
dataset.py脚本中的描述和文件名。



