crumb/c4-benchfilter-nano

Name: crumb/c4-benchfilter-nano
Creator: crumb
Published: 2023-10-22 19:22:56
License: 暂无描述

Hugging Face2023-10-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/crumb/c4-benchfilter-nano

下载链接

链接失效反馈

官方服务：

资源简介：

--- language_creators: - found language: - en license: odc-by source_datasets: - c4 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling dataset_info: features: - name: text dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 373897649.51453334 num_examples: 278115 download_size: 242478448 dataset_size: 373897649.51453334 configs: - config_name: default data_files: - split: train path: data/train-* size_categories: - 100K<n<1M --- # crumb/c4-benchfilter-nano A 278k sample derivation of the first 3M samples from the C4 dataset for a cheap and short continued pretraining for language models to optimize for benchmark scores without sacrificing generalization and generative modelling unrelated to chat or 'instruct' data. The estimated top 10% of highest estimated length normalized ngram (mean of tri, quad, and penta-gram) overlaps for each of the selected benchmark datasets (arc, truthful_qa, hellaswag, mmlu, humaneval) based on 1k samples, within the first 3M samples of C4. The top scoring sample datasets for each benchmark are then filtered again for top 30% scores and combined and exact-match de-duplicated. Then the top 3% scores and samples less than 200 characters long are removed because they likely have exact large n-token matches by chance such as exact dates or times that aren't actually relevant to the data.\* \*Upon further examination, some of these samples are still present throughout the data, albeit at much lower frequency than before, you might benefit from using `dataset.filter(x['score'] > thresh)` for some threshold, but you risk losing high quality samples as well, this tradeoff should be well-examined before training.

提供机构：

crumb

原始信息汇总

数据集概述

基本信息

语言创建者: 发现
语言: 英语
许可证: odc-by
源数据集: c4
任务类别:
- 文本生成
- 填充掩码
任务ID:
- 语言建模
- 掩码语言建模

数据集结构

特征:
- text: 字符串类型
- score: 浮点数类型 (float64)
分割:
- train:
  - 字节数: 373897649.51453334
  - 样本数: 278115
下载大小: 242478448
数据集大小: 373897649.51453334

配置

配置名称: default
数据文件:
- split: train
- path: data/train-*

大小类别

100K<n<1M

数据集描述

该数据集是从C4数据集的前300万个样本中提取的278k样本，用于语言模型的廉价和短期继续预训练，以优化基准分数，同时不牺牲泛化和生成建模能力，与聊天或“指导”数据无关。
根据每个选定的基准数据集（arc, truthful_qa, hellaswag, mmlu, humaneval）的1k样本，估计长度归一化的ngram（三元、四元和五元）重叠的前10%的最高估计分数。每个基准数据集的最高得分样本再次过滤前30%的分数，并进行合并和精确匹配去重。然后移除前3%的得分样本和长度小于200个字符的样本，因为它们可能偶然有精确的大n-token匹配，如精确的日期或时间，这些实际上与数据无关。
进一步检查发现，这些样本在数据中仍然存在，尽管频率比以前低得多，您可能受益于使用dataset.filter(x[score] > thresh)设置某个阈值，但您也可能失去高质量样本，因此在训练前应仔细权衡这一取舍。

5,000+

优质数据集

54 个

任务类型

进入经典数据集