stanford-crfm/heuristic_classification-filtered-pile-50M

Name: stanford-crfm/heuristic_classification-filtered-pile-50M
Creator: stanford-crfm
Published: 2023-09-16 16:06:56
License: 暂无描述

Hugging Face2023-09-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从The Pile中通过启发式分类数据选择方法选出的子集，目标分布是The Pile中的Wikipedia和BookCorpus2子集。数据集包含51.2M个训练样本，格式为jsonl。数据实例展示了文本内容、元数据和ID。数据集的创建过程涉及从非Wikipedia和书籍数据中选择样本，然后随机从Wikipedia、BookCorpus2、Gutenberg和Books3中选择样本，最后将每两个样本连接起来。数据集的来源是The Pile，初始数据收集和标准化过程包括将文档分成128个词的块，并应用手动质量过滤器。使用数据时需要注意数据集偏向于选择非Wikipedia和非书籍来源的数据，建议混合更多来自Wikipedia和书籍的数据。

This dataset is a subset of The Pile, selected via the heuristic classification data selection method. The target distribution for heuristic classification are the Wikipedia and BookCorpus2 subsets of The Pile. The dataset contains 51.2M training examples in jsonl format. Data instances include contents (text content), metadata (information about the source(s) of text), and id (a non-unique identifier). The creation process of the dataset involves selecting data from The Pile, applying a manual quality filter, and further filtering through heuristic classification. The dataset is biased towards non-Wikipedia and non-Books sources.

提供机构：

stanford-crfm

原始信息汇总

数据集卡片 for heuristic_classification-filtered-pile-50M

数据集描述

数据集概述

该数据集是通过启发式分类数据选择方法从The Pile中筛选出的子集。目标分布是The Pile中的Wikipedia和BookCorpus2子集。

语言

英语 (EN)

数据集结构

提供了一个训练集（51.2M个样本），格式为jsonl。

数据实例

json {"contents": "Members join for free and will have access to all of our earning verticals, including, but not limited to, watching videos, shopping for cash back, taking surveys, and redeeming special offers. Swagbucks is the webs leading rewards platform, dedicated to providing FREE gift cards to its 12+ million members. Choose from top retailers like Amazon, Target, Walmart, Starbucks, PayPal, and tons more.dead full espanol tle work is running out. Youu2019re given a descargar land of the dead full espanol but that respect itu2019s tons of one another. When the screen. With the pluses gained from a ledge, your arms or abandons your name suggests, Inferno has locked on a dash for a poozer, itu2019s placed in their shadowing skills. These controls forward, backward, and frankly, the straights. You can also have expected, but thatu2019s unlike anything particularly adept pacing. Each win by so rough idea thatu2019s worth it up. There are a neat sensation to play of a fresh

the voice actors give up with content and the same innovative control scheme that pulls you invested. From the movement. The unique art style and is still remarkably tough. Youu2019re not", "metadata": {"pile_set_name": ["Pile-CC", "Pile-CC"]}, "id": 303}

数据字段

json "contents": 文本内容 "metadata": 包含文本来源的信息，多个来源表示该样本是从两个来源拼接而成。 "id": 忽略 - 非唯一标识符

数据集创建

首先选择102.4M个样本，然后将每两个样本拼接成51.2M个样本。这确保了样本长度足够长，无需过多填充即可达到最大令牌长度512。我们使用The Pile验证集训练fasttext二进制分类器进行启发式分类，目标为Wikipedia + BookCorpus2 + Gutenberg + Books3，原始数据来自The Pile中的其他数据源。首先从非Wikipedia和书籍数据中选择98.4M个样本，然后从Wikipedia随机选择2M个样本，从BookCorpus2、Gutenberg和Books3各选择0.66M个样本。之后，将每两个样本拼接。

源数据

The Pile

初始数据收集和规范化

从The Pile中选择数据，The Pile包含30个随机块。我们保留第0块用于验证目的，仅考虑最后29块。首先将The Pile中的文档按128个单词的块进行分割，根据空白字符进行标记化。这些块定义了我们进行数据选择的样本，总计1.7B个样本。在进行启发式分类之前，首先应用手动质量过滤器（详见论文），仅考虑通过过滤器的样本。

使用数据时的注意事项

该数据集偏向于选择非Wikipedia和非书籍来源的数据。平衡的方法是混合更多来自Wikipedia和书籍的数据。

数据集策展人

Sang Michael Xie, Shibani Santurkar

引用信息

@article{xie2023data, author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang}, journal = {arXiv preprint arXiv:2302.03169}, title = {Data Selection for Language Models via Importance Resampling}, year = {2023}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集