five

yentinglin/fineweb_miniseries

收藏
Hugging Face2024-05-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/yentinglin/fineweb_miniseries
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en pretty_name: FineWebMini size_categories: - 1B - 10B - 100B - 350B --- # FineWeb Miniseries Dataset The FineWeb Miniseries Dataset is a collection of random subsets of the FineWeb dataset, created for training and experimenting with language models of different sizes. The subsets are generated based on target token counts, providing a range of dataset sizes suitable for various computational resources and research purposes. ## Inspiration The FineWeb Miniseries Dataset was inspired by a [tweet](https://x.com/karpathy/status/1786504106347221498) from Andrej Karpathy ([@karpathy](https://twitter.com/karpathy)) on May 4, 2024, where he was looking for a manageable ~1GB sample of a larger dataset for debugging and mentioned the idea of having subsets at different scales. The goal of the FineWeb Miniseries Dataset is to make experimenting with large language models more approachable for the GPU & Disk Poor. ## Dataset Subsets The dataset consists of the following subsets: - **1B**: A subset containing approximately 1 billion GPT2 tokens. - **10B**: A subset containing approximately 10 billion GPT2 tokens. - **100B**: A subset containing approximately 100 billion GPT2 tokens. - **350B**: A subset containing approximately 350 billion GPT2 tokens. Each subset is created by randomly sampling rows from the original FineWeb dataset while maintaining the average GPT2 tokens per row. The random sampling is performed with a fixed seed (42) to ensure reproducibility. ## Usage To use the FineWeb Miniseries Dataset, you can load the desired subset using the Hugging Face Datasets library. Here's an example of how to load a subset: ```python from datasets import load_dataset # Load the "1B" subset subset_1b = load_dataset("yentinglin/fineweb_miniseries", "1B") ``` Replace `"1B"` with the desired subset name (`"10B"`, `"100B"`, or `"350B"`) to load the corresponding subset. ## Dataset Creation The subsets of the FineWeb Miniseries Dataset are created using the following code: ```python from datasets import load_dataset # Load the "fineweb" dataset fw = load_dataset("HuggingFaceFW/fineweb") # Calculate the average GPT2 tokens per row average_tokens_per_row = 15352.9e9 / 22335106879 # Define the target token counts for each subset target_token_counts = [1e9, 10e9, 100e9, 350e9] # Shuffle the dataset shuffled_dataset = fw.shuffle(seed=42) # Create and push the subsets to the Hugging Face Hub for target_tokens in target_token_counts: # Calculate the number of rows needed for the target token count num_rows = int(target_tokens / average_tokens_per_row) # Select a random subset of the shuffled dataset subset = shuffled_dataset.select(range(num_rows)) # Push the subset to the Hugging Face Hub subset_name = f"{int(target_tokens/1e9)}B" subset.push_to_hub(f"yentinglin/fineweb_miniseries", subset_name) print(f"Pushed {subset_name} subset to the Hugging Face Hub.") ``` ## Original Dataset The FineWeb Miniseries Dataset is derived from the original FineWeb dataset, which can be found at [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb). The original dataset contains a large collection of web pages suitable for training language models. ## License The FineWeb Miniseries Dataset is released under the same license as the original FineWeb dataset. Please refer to the original dataset's license for more information. ## Citation If you use the FineWeb Miniseries Dataset in your research or projects, please cite the original FineWeb dataset: ``` @software{penedo2024fineweb, author = {Penedo, Guilherme and Kydlíček, Hynek and von Werra, Leandro and Wolf, Thomas}, title = {FineWeb}, month = April, year = 2024, doi = { 10.57967/hf/2092 }, url = {https://huggingface.co/datasets/HuggingFaceFW/fineweb} } ``` Feel free to customize the README based on your specific requirements and add any additional information that you think would be helpful for users of the FineWeb Miniseries Dataset.
提供机构:
yentinglin
原始信息汇总

FineWeb Miniseries Dataset 概述

数据集描述

FineWeb Miniseries Dataset 是由 FineWeb 数据集的随机子集组成的集合,旨在为不同规模的语言模型训练和实验提供数据。这些子集根据目标令牌数生成,适合各种计算资源和研究目的。

数据集子集

数据集包含以下子集:

  • 1B: 约含10亿GPT2令牌。
  • 10B: 约含100亿GPT2令牌。
  • 100B: 约含1000亿GPT2令牌。
  • 350B: 约含3500亿GPT2令牌。

每个子集通过随机抽样原始 FineWeb 数据集的行来创建,同时保持每行平均GPT2令牌数。随机抽样使用固定种子(42)以确保可重复性。

使用方法

使用 Hugging Face Datasets 库加载所需子集。例如,加载 "1B" 子集的代码如下:

python from datasets import load_dataset

加载 "1B" 子集

subset_1b = load_dataset("yentinglin/fineweb_miniseries", "1B")

通过替换 "1B" 为其他子集名称(如 "10B", "100B", 或 "350B")来加载相应子集。

数据集创建

FineWeb Miniseries Dataset 的子集创建过程包括从原始 FineWeb 数据集加载数据,计算每行平均GPT2令牌数,然后根据目标令牌数随机抽样并推送至 Hugging Face Hub。

原始数据集

FineWeb Miniseries Dataset 源自原始 FineWeb 数据集,后者包含大量适合语言模型训练的网页数据。

许可证

FineWeb Miniseries Dataset 遵循与原始 FineWeb 数据集相同的许可证。详细信息请参考原始数据集的许可证。

引用

如在研究或项目中使用此数据集,请引用原始 FineWeb 数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作