BEE-spoke-data/open-web-math-minhash

Name: BEE-spoke-data/open-web-math-minhash
Creator: BEE-spoke-data
Published: 2023-10-12 02:08:46
License: 暂无描述

Hugging Face2023-10-12 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/open-web-math-minhash

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - config_name: text-only data_files: - split: train path: text-only/train-* dataset_info: - config_name: default features: - name: url dtype: string - name: text dtype: string - name: date dtype: string - name: metadata dtype: string splits: - name: train num_bytes: 4467051029 num_examples: 1820241 download_size: 1772035124 dataset_size: 4467051029 - config_name: text-only features: - name: text dtype: string splits: - name: train num_bytes: 2305854627 num_examples: 1820241 download_size: 1360869461 dataset_size: 2305854627 license: odc-by task_categories: - text-generation size_categories: - 1M<n<10M source_datasets: open-web-math/open-web-math --- # Dataset Card for "open-web-math-minhash" An attempt at a _"high quality sample"_ of `open-web-math/open-web-math` by aggressively applying `minhash` from text-dedup. The result is 1.82M rows down from the original 6M: ``` DatasetDict({ train: Dataset({ features: ['url', 'text', 'date', 'metadata'], num_rows: 1820241 }) }) ``` ## Usage Unless you need the metadata, load the `text-only` config which is only 1.4 GB/5 shards: ```python from datasets import load_dataset dataset_config = "text-only" dataset = load_dataset("BEE-spoke-data/open-web-math-minhash", dataset_config) ``` ## making of On a high-RAM colab TPU (40 cores) ```python from pathlib import Path from tqdm.auto import tqdm ds_name = "open-web-math/open-web-math" dataset_config = "default" data_split = 'train' text_column = 'text' out_dir = Path(f"output/minhash/{ds_short_name}/{data_split}") !mkdir -p $out_dir !python -m text_dedup.minhash \ --path $ds_name \ --name $dataset_config \ --split $data_split \ --cache_dir "./cache" \ --output $out_dir \ --column $text_column \ --ngram 5 --threshold 0.5 \ --hash_func xxh3 --hash_bits 16 --num_perm 64 \ --batch_size 10000 print(f"output dir is:\n\t{out_dir}") !ls $out_dir ``` Console: ```sh Resolving data files: 100% 114/114 [00:11<00:00, 9.79it/s] Fingerprinting... (num_proc=40): 100% 6315233/6315233 [15:27<00:00, 6806.11 examples/s] Iterating MinHashes...: 100% 632/632 [05:37<00:00, 1.87it/s] Clustering...: 100% 14/14 [01:13<00:00, 5.22s/it] Finding clusters... (num_proc=40): 100% 6315233/6315233 [10:57<00:00, 9602.90 examples/s] Filtering clusters... (num_proc=40): 100% 6315233/6315233 [03:53<00:00, 27069.61 examples/s] Saving the dataset (33/33 shards): 100% 1820241/1820241 [07:07<00:00, 4260.38 examples/s] [10/11/23 23:41:46] INFO Loading : ``` ## citation ``` @misc{paster2023openwebmath, title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text}, author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba}, year={2023}, eprint={2310.06786}, archivePrefix={arXiv}, primaryClass={cs.AI} } ```

提供机构：

BEE-spoke-data

原始信息汇总

数据集概述

数据集配置

默认配置

配置名称: default
数据文件路径: data/train-*
特征:
- url: 字符串类型
- text: 字符串类型
- date: 字符串类型
- metadata: 字符串类型
拆分:
- train: 字节数 4467051029, 样本数 1820241
下载大小: 1772035124 字节
数据集大小: 4467051029 字节

仅文本配置

配置名称: text-only
数据文件路径: text-only/train-*
特征:
- text: 字符串类型
拆分:
- train: 字节数 2305854627, 样本数 1820241
下载大小: 1360869461 字节
数据集大小: 2305854627 字节

数据集信息

许可证: odc-by
任务类别: 文本生成
大小类别: 1M<n<10M
源数据集: open-web-math/open-web-math

数据集描述

该数据集是 open-web-math/open-web-math 的高质量样本，通过 minhash 算法从原始的 6M 行数据中筛选出 1.82M 行数据。

使用示例

python from datasets import load_dataset

dataset_config = "text-only" dataset = load_dataset("BEE-spoke-data/open-web-math-minhash", dataset_config)

5,000+

优质数据集

54 个

任务类型

进入经典数据集