five

flax-community/german_common_crawl

收藏
Hugging Face2023-10-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/flax-community/german_common_crawl
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de --- The dataset script is more or less ready and one file has correctly been converted so far: `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz` You can try downloading the file as follows: ```python from datasets import load_dataset ds = load_dataset("flax-community/german_common_crawl", "first") ``` This can be done on your local computer and should only take around 2GB of disk space. This however only loads the first of >100 files. We now need to add **all** other files to this repo. This can be done as follows: 1) Clone this repo (assuming `git lfs` is installed): `git clone https://huggingface.co/datasets/flax-community/german_common_crawl` 2) For each file: `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2016-18.tar.gz` - `https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz` run the command `./convert_file.sh <file_name>` This command will download the file via `wget`, filter out all text that is below a threshold as explained here: https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/middle/de_middle_0009_2019-47.tar.gz and then converts the file into the correct format. 3) Upload the file to this repo: `git add . && git commit -m "add file x" && git push Ideally this can be done in a loop on a computer that has enough CPU memory (Note that if this is done on a TPU VM, make sure to disable the TPU via `export JAX_PLATFORM_NAME=cpu`. Also some description and file names have to be added correctly to the dataset.py script
提供机构:
flax-community
原始信息汇总

数据集概述

数据集名称

  • 名称:german_common_crawl
  • 归属:flax-community

数据集内容

  • 语言:德语(de)
  • 文件数量:超过100个文件
  • 示例文件:
    • 链接:https://opendata.iisys.de/systemintegration/Datasets/CommonCrawl/head/de_head_0000_2015-48.tar.gz
    • 格式:tar.gz

数据集处理

  • 文件处理:使用脚本convert_file.sh进行下载、过滤和格式转换。
  • 过滤标准:文本内容需满足特定阈值。

数据集加载

  • 加载方法:使用load_dataset函数,如: python from datasets import load_dataset ds = load_dataset("flax-community/german_common_crawl", "first")

  • 存储需求:约2GB磁盘空间。

数据集维护

  • 维护步骤:
    1. 克隆仓库:git clone https://huggingface.co/datasets/flax-community/german_common_crawl
    2. 处理每个文件:运行./convert_file.sh <file_name>
    3. 上传文件:执行git add . && git commit -m "add file x" && git push

注意事项

  • 在TPU VM上操作时,需禁用TPU:export JAX_PLATFORM_NAME=cpu
  • 需更新dataset.py脚本中的描述和文件名。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作