mmarone/fineweb-edu-full-metadata

Name: mmarone/fineweb-edu-full-metadata
Creator: mmarone
Published: 2025-05-06 01:48:59
License: 暂无描述

Hugging Face2025-05-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mmarone/fineweb-edu-full-metadata

下载链接

链接失效反馈

官方服务：

资源简介：

[WIP] # FineWeb-Edu with Metadata This repo contains 3 versions of the FineWeb-Edu v1 dataset: ``` fwedu1-metaonly/ fwedu1-text-content-zstd/ fineweb-edu-1.0.0-meta-and-text/ ``` These are all joinable via the hash column, which is [xxhash64](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.xxhash64.html) in pyspark, calculated on the text column. This hash is unique for all instances in the dataset. For convenience, this join is done for you in the third table `fwedu1-metaonly` is just the metadata of the data exactly as it comes from the FineWeb-Edu v1 subset. This include duplicates! There are XXX records. For instance, identical text content might have been found at several different urls, across many CC dumps. The advantage of storing this data separately is that is is MUCH smaller than the text data and still allows for useful analysis - and you can always join it back! ``` DataFrame[id: string, dump: string, url: string, file_path: string, language: string, language_score: double, token_count: bigint, score: double, int_score: bigint, hash: bigint] ``` `fwedu1-text-content-zstd/` is the deduplicated data and is a table containing only the text content and the hash. This saves space - we don't need to store redundant copies of the text data. ``` DataFrame[hash: bigint, rebuilt_count: bigint, first_text: string] ``` `fineweb-edu-1.0.0-meta-and-text/` is the joined data, containing both the text data the metadata. It has the count columns used in our work (to come) and has the varying instance level data (e.g. url) compressed into a struct column. ``` DataFrame[hash: bigint, text: string, instances: array<struct<dump:string,file_path:string,id:string,url:string>>, language: string, language_score: double, token_count: bigint, score: double, int_score: bigint, split: string, original_doc_count: bigint, position: int, reversed_count: int, tiktoken_size: int] ``` This lets you easily run a query like this: ```python from pyspark.sql import functions as F from pyspark.sql.types import ArrayType, StringType df = spark.read.parquet("fineweb-1.0.0-meta-and-text") filtered_df = df.filter(F.size(F.array_distinct(F.transform(F.col("instances"), lambda x: x.url))) > 1) print(filtered_df.count()) filtered_df.show() # 57292242 57M documents are found at more than one url - many of these are trivial differences like http vs https, but some reflect more interesting patterns like migrations or rehosts. ``` Which finds all duplicated text content that appears at distinct urls! **NOTE: This was built on v1 of the FineWeb-Edu dataset, which has been updated since**

提供机构：

mmarone

5,000+

优质数据集

54 个

任务类型

进入经典数据集