flax-community/german_common_crawl

Name: flax-community/german_common_crawl
Creator: flax-community
Published: 2023-10-02 16:46:37
License: 暂无描述

Hugging Face2023-10-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/flax-community/german_common_crawl

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - de --- The dataset script is more or less ready and one file has correctly been converted so far: `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz` You can try downloading the file as follows: ```python from datasets import load_dataset ds = load_dataset("flax-community/german_common_crawl", "first") ``` This can be done on your local computer and should only take around 2GB of disk space. This however only loads the first of >100 files. We now need to add **all** other files to this repo. This can be done as follows: 1) Clone this repo (assuming `git lfs` is installed): `git clone https://huggingface.co/datasets/flax-community/german_common_crawl` 2) For each file: `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2016-18.tar.gz` - `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz` run the command `./convert_file.sh <file_name>` This command will download the file via `wget`, filter out all text that is below a threshold as explained here: https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz and then converts the file into the correct format. 3) Upload the file to this repo: `git add . && git commit -m "add file x" && git push Ideally this can be done in a loop on a computer that has enough CPU memory (Note that if this is done on a TPU VM, make sure to disable the TPU via `export JAX_PLATFORM_NAME=cpu`. Also some description and file names have to be added correctly to the dataset.py script

提供机构：

flax-community

原始信息汇总

数据集概述

数据集名称

名称：german_common_crawl
归属：flax-community

数据集内容

语言：德语（de）
文件数量：超过100个文件
示例文件：
- 链接：https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz
- 格式：tar.gz

数据集处理

文件处理：使用脚本convert_file.sh进行下载、过滤和格式转换。
过滤标准：文本内容需满足特定阈值。

数据集加载

加载方法：使用load_dataset函数，如： python from datasets import load_dataset ds = load_dataset("flax-community/german_common_crawl", "first")
存储需求：约2GB磁盘空间。

数据集维护

维护步骤：
1. 克隆仓库：git clone https://huggingface.co/datasets/flax-community/german_common_crawl
2. 处理每个文件：运行./convert_file.sh <file_name>
3. 上传文件：执行git add . && git commit -m "add file x" && git push

注意事项

在TPU VM上操作时，需禁用TPU：export JAX_PLATFORM_NAME=cpu
需更新dataset.py脚本中的描述和文件名。

5,000+

优质数据集

54 个

任务类型

进入经典数据集