stanford-crfm/heuristic_classification-filtered-pile-50M
收藏数据集卡片 for heuristic_classification-filtered-pile-50M
数据集描述
数据集概述
该数据集是通过启发式分类数据选择方法从The Pile中筛选出的子集。目标分布是The Pile中的Wikipedia和BookCorpus2子集。
语言
英语 (EN)
数据集结构
提供了一个训练集(51.2M个样本),格式为jsonl。
数据实例
json {"contents": "Members join for free and will have access to all of our earning verticals, including, but not limited to, watching videos, shopping for cash back, taking surveys, and redeeming special offers. Swagbucks is the webs leading rewards platform, dedicated to providing FREE gift cards to its 12+ million members. Choose from top retailers like Amazon, Target, Walmart, Starbucks, PayPal, and tons more.dead full espanol tle work is running out. Youu2019re given a descargar land of the dead full espanol but that respect itu2019s tons of one another. When the screen. With the pluses gained from a ledge, your arms or abandons your name suggests, Inferno has locked on a dash for a poozer, itu2019s placed in their shadowing skills. These controls forward, backward, and frankly, the straights. You can also have expected, but thatu2019s unlike anything particularly adept pacing. Each win by so rough idea thatu2019s worth it up. There are a neat sensation to play of a fresh
the voice actors give up with content and the same innovative control scheme that pulls you invested. From the movement. The unique art style and is still remarkably tough. Youu2019re not", "metadata": {"pile_set_name": ["Pile-CC", "Pile-CC"]}, "id": 303}
数据字段
json "contents": 文本内容 "metadata": 包含文本来源的信息,多个来源表示该样本是从两个来源拼接而成。 "id": 忽略 - 非唯一标识符
数据集创建
首先选择102.4M个样本,然后将每两个样本拼接成51.2M个样本。这确保了样本长度足够长,无需过多填充即可达到最大令牌长度512。我们使用The Pile验证集训练fasttext二进制分类器进行启发式分类,目标为Wikipedia + BookCorpus2 + Gutenberg + Books3,原始数据来自The Pile中的其他数据源。首先从非Wikipedia和书籍数据中选择98.4M个样本,然后从Wikipedia随机选择2M个样本,从BookCorpus2、Gutenberg和Books3各选择0.66M个样本。之后,将每两个样本拼接。
源数据
The Pile
初始数据收集和规范化
从The Pile中选择数据,The Pile包含30个随机块。我们保留第0块用于验证目的,仅考虑最后29块。首先将The Pile中的文档按128个单词的块进行分割,根据空白字符进行标记化。这些块定义了我们进行数据选择的样本,总计1.7B个样本。在进行启发式分类之前,首先应用手动质量过滤器(详见论文),仅考虑通过过滤器的样本。
使用数据时的注意事项
该数据集偏向于选择非Wikipedia和非书籍来源的数据。平衡的方法是混合更多来自Wikipedia和书籍的数据。
数据集策展人
Sang Michael Xie, Shibani Santurkar
引用信息
@article{xie2023data, author = {Sang Michael Xie and Shibani Santurkar and Tengyu Ma and Percy Liang}, journal = {arXiv preprint arXiv:2302.03169}, title = {Data Selection for Language Models via Importance Resampling}, year = {2023}, }



