five

Aleph-Alpha-GermanWeb

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Aleph-Alpha/Aleph-Alpha-GermanWeb
下载链接
链接失效反馈
官方服务:
资源简介:
# AlephAlphaGermanWeb Aleph-Alpha-GermanWeb is a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks. The dataset draws from three sources: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. In our [<span style="color:blue">**accompanying paper**</span>](https://arxiv.org/pdf/2505.00022v1), we evaluated our dataset by training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Here we provide code and data for recreation of the three parts of the dataset. We also share our trained model-based filters with accompanying inference scripts [here](https://huggingface.co/collections/Aleph-Alpha/aleph-alpha-germanweb-68010b712bf06d3479055d49), as well as the full prompts for synthetic data generation in our paper. ## (1) How to Set Up the Filtered Common Crawl Dataset The "cc" dataset of AlephAlphaGermanWeb contains references to specific Common Crawl snapshots. It is based on the following six Common Crawl snapshots: - CC-MAIN-2024-38 - CC-MAIN-2024-42 - CC-MAIN-2024-46 - CC-MAIN-2024-51 - CC-MAIN-2025-05 - CC-MAIN-2025-08 Our pipeline used the `nemo_curator.download.download_common_crawl` function from NeMo Curator version [0.6.0](https://github.com/NVIDIA/NeMo-Curator/tree/v0.6.0) to download the files and extract the text. The instructions below assume that you have already downloaded and processed the files using this function and that they are available locally in a folder named `/path/to/cc-snapshots`. The following code snippet demonstrates how to create a Hugging Face [`IterableDataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset) containing all Common Crawl records referenced by the "cc" dataset: ```python from itertools import islice from datasets import load_dataset # Load the dataset containing the filtered WARC IDs filter_dataset = load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="cc", split="train") filtered_warc_ids = frozenset(filter_dataset["warc_id"]) # Load the Common Crawl data and filter it cc_dumps_download_path = "/path/to/cc-snapshots" cc = load_dataset("parquet", data_dir=cc_dumps_download_path, streaming=True, split="train") filtered_ds = cc.filter(lambda row: row["warc_id"] in filtered_warc_ids) # Print the first 10 filtered records for record in islice(filtered_ds, 10): print(record) ``` **Note:** This process loads all `warc_id`s of the records to be retained into memory, which requires approximately 3.5 GB of RAM. Depending on your hardware, it may take up to 10 minutes or more before the iteration begins. ## (2) How to Set Up the Filtered FineWeb 2 Dataset The following code snippet demonstrates how to create a Hugging Face [`IterableDataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset) containing all FineWeb 2 records referenced by the "fineweb2-high" dataset: ```python import datasets # Load the dataset containing the filtered IDs fineweb_filtered = datasets.load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="fineweb2-high", split="train") fineweb_ids = frozenset(fineweb_filtered["id"]) # Load the FineWeb 2 data and filter it fineweb = datasets.load_dataset("HuggingFaceFW/fineweb-2", name="deu_Latn", split="train", streaming=True) filtered_fineweb = fineweb.filter(lambda row: row["id"] in fineweb_ids) ``` **Note:** This process loads all `id`s of the records to be retained into memory, which requires approximately 3.5 GB of RAM. Depending on your hardware, it may take up to 10 minutes or more before the iteration begins. ## (3) Synthetic Dataset The synthetic dataset contains the actual data and can be loaded as follows: ```python import datasets datasets.load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="synthetic", split="train") ``` The synthetic dataset contains two columns, `text` and `prompt_id`. The `text` is the post-processed, synthesised text generated by the LLM. The `prompt_id` is an integer between 0 and 4, and indicates the prompt template which was used for generation. These integers correspond, in order, to the following named prompts [in the appendix of our paper](https://arxiv.org/pdf/2505.00022v1): `rephrasing`, `summarisation`, `rephrasing in Wikipedia style`, `formulating questions`, and `extracting lists`. ## Save IterableDataset to Disk The scripts above for loading the filtered CommonCrawl and FineWeb2 datasets will result in IterableDatasets, which don't have a `save_to_disk` function. In case you want to save the dataset to disk, you can use the following snippet. ```python filtered_fineweb = fineweb.filter(lambda x: x["id"] in fineweb_ids) \ .remove_columns(["dump", "url", "date", "file_path", "language", "language_score", "language_script", "minhash_cluster_size", "top_langs"]) features = datasets.Features({ "text": datasets.Value("string"), "id": datasets.Value("string"), }) dataset = datasets.Dataset.from_generator(lambda: (yield from filtered_fineweb), features=features) dataset.save_to_disk("/path/to/dataset", max_shard_size="4GB") ```

# AlephAlphaGermanWeb AlephAlphaGermanWeb是一款全新的德语数据集,它结合启发式过滤、基于模型的过滤技术与合成数据生成方法,在德语基准测试中实现了当前最优(SOTA)性能。该数据集源自三大来源:(1) Common Crawl网页数据,(2) FineWeb2,(3) 基于真实原生网页数据生成的合成数据。 在我们的[配套论文](https://arxiv.org/pdf/2505.00022v1)中,我们通过训练10亿参数的Llama风格模型与80亿参数的无分词器分层自回归Transformer(HAT)对该数据集进行了评测。在包括MMMLU在内的德语基准测试中,相较于仅使用FineWeb2训练的模型,AlephAlphaGermanWeb带来了显著的性能提升。即便将FineWeb2与维基百科等人工精选的高质量数据源进行融合,该优势在80亿参数规模下依然成立。 本项目提供了复现该数据集三个组成部分所需的代码与数据。我们还在[此处](https://huggingface.co/collections/Aleph-Alpha/aleph-alpha-germanweb-68010b712bf06d3479055d49)分享了训练好的基于模型的过滤器及配套推理脚本,同时附带了论文中用于合成数据生成的完整提示词。 ## (1) 过滤后的Common Crawl数据集部署方法 AlephAlphaGermanWeb的“cc”数据集包含了特定Common Crawl快照的引用信息,其构建基于以下六个Common Crawl快照: - CC-MAIN-2024-38 - CC-MAIN-2024-42 - CC-MAIN-2024-46 - CC-MAIN-2024-51 - CC-MAIN-2025-05 - CC-MAIN-2025-08 我们的流程使用了NeMo Curator版本0.6.0中的`nemo_curator.download.download_common_crawl`函数来下载文件并提取文本。下文的操作说明假设您已通过该函数下载并处理了相关文件,且这些文件已存储在本地目录`/path/to/cc-snapshots`中。 以下代码示例展示了如何创建一个Hugging Face [`IterableDataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset),以包含“cc”数据集所引用的所有Common Crawl记录: python from itertools import islice from datasets import load_dataset # Load the dataset containing the filtered WARC IDs filter_dataset = load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="cc", split="train") filtered_warc_ids = frozenset(filter_dataset["warc_id"]) # Load the Common Crawl data and filter it cc_dumps_download_path = "/path/to/cc-snapshots" cc = load_dataset("parquet", data_dir=cc_dumps_download_path, streaming=True, split="train") filtered_ds = cc.filter(lambda row: row["warc_id"] in filtered_warc_ids) # Print the first 10 filtered records for record in islice(filtered_ds, 10): print(record) **注意:** 该流程会将所有需要保留的记录的`warc_id`加载至内存中,约需占用3.5GB的RAM。根据您的硬件配置,迭代开始前可能需要等待10分钟甚至更久。 ## (2) 过滤后的FineWeb2数据集部署方法 以下代码示例展示了如何创建一个Hugging Face [`IterableDataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset),以包含“fineweb2-high”数据集所引用的所有FineWeb2记录: python import datasets # Load the dataset containing the filtered IDs fineweb_filtered = datasets.load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="fineweb2-high", split="train") fineweb_ids = frozenset(fineweb_filtered["id"]) # Load the FineWeb 2 data and filter it fineweb = datasets.load_dataset("HuggingFaceFW/fineweb-2", name="deu_Latn", split="train", streaming=True) filtered_fineweb = fineweb.filter(lambda row: row["id"] in fineweb_ids) **注意:** 该流程会将所有需要保留的记录的`id`加载至内存中,约需占用3.5GB的RAM。根据您的硬件配置,迭代开始前可能需要等待10分钟甚至更久。 ## (3) 合成数据集 合成数据集包含了真实的生成数据,可通过以下方式加载: python import datasets datasets.load_dataset("Aleph-Alpha/Aleph-Alpha-GermanWeb", name="synthetic", split="train") 合成数据集包含`text`与`prompt_id`两列。其中`text`为大语言模型(LLM)生成并经过后处理的合成文本;`prompt_id`为0至4之间的整数,用于标识生成时所使用的提示词模板。这些整数依次对应我们论文[附录](https://arxiv.org/pdf/2505.00022v1)中的以下命名提示词:`rephrasing`(重写)、`summarisation`(摘要生成)、`rephrasing in Wikipedia style`(维基百科风格重写)、`formulating questions`(问题生成)以及`extracting lists`(列表提取)。 ## 将IterableDataset保存至磁盘 上述用于加载过滤后Common Crawl与FineWeb2数据集的脚本会生成IterableDataset,该类型数据集不具备`save_to_disk`方法。若您需要将数据集保存至磁盘,可使用以下代码片段。 python filtered_fineweb = fineweb.filter(lambda x: x["id"] in fineweb_ids) .remove_columns(["dump", "url", "date", "file_path", "language", "language_score", "language_script", "minhash_cluster_size", "top_langs"]) features = datasets.Features({ "text": datasets.Value("string"), "id": datasets.Value("string"), }) dataset = datasets.Dataset.from_generator(lambda: (yield from filtered_fineweb), features=features) dataset.save_to_disk("/path/to/dataset", max_shard_size="4GB")
提供机构:
maas
创建时间:
2025-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作