bjoernp/oscar2023_deduped_filtered_1.1

Name: bjoernp/oscar2023_deduped_filtered_1.1
Creator: bjoernp
Published: 2023-11-13 09:18:16
License: 暂无描述

Hugging Face2023-11-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bjoernp/oscar2023_deduped_filtered_1.1

下载链接

链接失效反馈

官方服务：

资源简介：

Oscar 2023_01 DE Deduplicated是一个经过过滤和去重的德语文本数据集，源自OSCAR项目（Open Super-large Crawled Aggregated coRpus）的23.01版本OSCAR Corpus。该数据集基于2022年11月/12月的Common Crawl数据，通过MinHash算法去重，大小介于1000万至1亿文档之间。数据集详细记录了文档的元数据，包括语言、内容类型、有害内容概率等信息，并通过严格的过滤机制确保数据质量。该数据集遵循OSCAR Corpus的原始许可，采用CC0许可证，适用于大规模语言模型的预训练和自然语言处理研究。

提供机构：

bjoernp

原始信息汇总

Oscar 2023_01 DE Deduplicated

数据集概述

这是一个经过筛选和去重的德语子集版本，基于23.01 OSCAR Corpus，由OSCAR项目（Open Super-large Crawled Aggregated coRpus）策划。OSCAR 23.01是2023年1月的OSCAR Corpus版本，基于2022年11月/12月的Common Crawl数据。

去重方法

使用text-dedup库中的MinHash实现进行去重，具体命令如下： bash python -m text_dedup.minhash --path oscar-corpus/OSCAR-2301 --name "de" --cache_dir "../cache" --split "train" --column "text" --batch_size 10000 --output output/minhash_oscar_de_dedup

去重统计

步骤	运行时间
Loading	10.64s
MinHashing	10574.02s
Clustering	12187.65s
Filtering	4198.70s
Saving	3560.06s
Total	30531.07s

数据集	文档数量
Before	103299215
After	53172498

数据集结构

json { "text":"English sentence phrase en français ????????????", "meta":{ "warc_headers":{ "warc-identified-content-language":"fra,eng", "warc-target-uri":"https://fr.wikipedia.org/wiki/...", "warc-record-id":"urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116", "warc-type":"conversion", "content-length":"35298", "warc-refers-to":"urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64", "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", "warc-date":"2022-11-26T09:45:47Z", "content-type":"text/plain" }, "identification":{ "label":"fr", "prob":0.8938327 }, "harmful_pp":4063.1814, "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", "quality_warnings":[ "short_sentences", "header", "footer" ], "categories":[ "examen_pix", "liste_bu" ], "sentence_identifications":[ { "label":"fr", "prob":0.99837273 }, { "label":"en", "prob":0.9992377 }, null ] } }

过滤方法

使用以下代码进行过滤（超参数可能略有不同）： python from datasets import load_dataset, load_from_disk import time

blocked_categories = set([ "adult", "aggressif", "malware", "phishing", "cryptojacking", "dangerous_material" ])

blocked_quality_warnings = set([ "tiny", "short sentences", "noisy" ])

harmful_ppl_threshold = 500 language_prob_threshold = 0.9

blocked_urls = set([ "de.wikipedia.org", "tagesschau.de" ])

def filter_content(example): has_blocked_category = False if "categories" in example["meta"] and example["meta"]["categories"] is not None: has_blocked_category = len(set(example["meta"]["categories"]).intersection(blocked_categories)) > 0 has_blocked_quality_warnings = False if "quality_warnings" in example["meta"] and example["meta"]["quality_warnings"] is not None: has_blocked_quality_warnings = len(set(example["meta"]["quality_warnings"]).intersection(blocked_quality_warnings)) > 0 has_blocked_url = False if "warc_headers" in example["meta"] and "warc-target-uri" in example["meta"]["warc_headers"] and example["meta"]["warc_headers"]["warc-target-uri"] is not None: has_blocked_url = any([url in example["meta"]["warc_headers"]["warc-target-uri"] for url in blocked_urls]) has_harmful_ppl = example["meta"]["harmful_pp"] < harmful_ppl_threshold if "harmful_pp" in example["meta"] else False has_bad_german_identification = example["meta"]["identification"]["prob"] < language_prob_threshold if "identification" in example["meta"] else True return not (has_blocked_category or has_blocked_quality_warnings or has_blocked_url or has_harmful_ppl or has_bad_german_identification)

t_start = time.time() ds = load_dataset("bjoernp/oscar2023_de_deduped", split="train", num_proc=128) print(f"Loading took {time.time() - t_start}s") print(f"Dataset size before filtering: {len(ds)}") t_start = time.time() ds = ds.filter(filter_content, num_proc=128) print(f"Filtering took {time.time() - t_start}s") print(f"Dataset size after filtering: {len(ds)}")

许可

遵循OSCAR Corpus的原始许可方案，数据集的元数据和注释采用Creative Commons CC0许可证（“无权利保留”）。

5,000+

优质数据集

54 个

任务类型

进入经典数据集