crumb/c4-benchfilter-nano
收藏Hugging Face2023-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/crumb/c4-benchfilter-nano
下载链接
链接失效反馈官方服务:
资源简介:
---
language_creators:
- found
language:
- en
license: odc-by
source_datasets:
- c4
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
dataset_info:
features:
- name: text
dtype: string
- name: score
dtype: float64
splits:
- name: train
num_bytes: 373897649.51453334
num_examples: 278115
download_size: 242478448
dataset_size: 373897649.51453334
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
size_categories:
- 100K<n<1M
---
# crumb/c4-benchfilter-nano
A 278k sample derivation of the first 3M samples from the C4 dataset for a cheap and short continued pretraining for language models to optimize for benchmark scores without sacrificing generalization and generative modelling unrelated to chat or 'instruct' data.
The estimated top 10% of highest estimated length normalized ngram (mean of tri, quad, and penta-gram) overlaps for each of the
selected benchmark datasets (arc, truthful_qa, hellaswag, mmlu, humaneval) based
on 1k samples, within the first 3M samples of C4. The top scoring sample
datasets for each benchmark are then filtered again for top 30% scores and
combined and exact-match de-duplicated. Then the top 3% scores and samples less than 200 characters long are removed
because they likely have exact large n-token matches by chance such as exact
dates or times that aren't actually relevant to the data.\*
\*Upon further examination, some of these samples are still present throughout the data, albeit at much lower frequency than before, you might benefit from using `dataset.filter(x['score'] > thresh)` for some threshold, but you risk losing high quality samples as well, this tradeoff should be well-examined before training.
提供机构:
crumb
原始信息汇总
数据集概述
基本信息
- 语言创建者: 发现
- 语言: 英语
- 许可证: odc-by
- 源数据集: c4
- 任务类别:
- 文本生成
- 填充掩码
- 任务ID:
- 语言建模
- 掩码语言建模
数据集结构
- 特征:
text: 字符串类型score: 浮点数类型 (float64)
- 分割:
train:- 字节数: 373897649.51453334
- 样本数: 278115
- 下载大小: 242478448
- 数据集大小: 373897649.51453334
配置
- 配置名称: default
- 数据文件:
split: trainpath: data/train-*
大小类别
- 100K<n<1M
数据集描述
- 该数据集是从C4数据集的前300万个样本中提取的278k样本,用于语言模型的廉价和短期继续预训练,以优化基准分数,同时不牺牲泛化和生成建模能力,与聊天或“指导”数据无关。
- 根据每个选定的基准数据集(arc, truthful_qa, hellaswag, mmlu, humaneval)的1k样本,估计长度归一化的ngram(三元、四元和五元)重叠的前10%的最高估计分数。每个基准数据集的最高得分样本再次过滤前30%的分数,并进行合并和精确匹配去重。然后移除前3%的得分样本和长度小于200个字符的样本,因为它们可能偶然有精确的大n-token匹配,如精确的日期或时间,这些实际上与数据无关。
- 进一步检查发现,这些样本在数据中仍然存在,尽管频率比以前低得多,您可能受益于使用
dataset.filter(x[score] > thresh)设置某个阈值,但您也可能失去高质量样本,因此在训练前应仔细权衡这一取舍。



