five

crumb/c4-benchfilter-nano

收藏
Hugging Face2023-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/crumb/c4-benchfilter-nano
下载链接
链接失效反馈
官方服务:
资源简介:
--- language_creators: - found language: - en license: odc-by source_datasets: - c4 task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling dataset_info: features: - name: text dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 373897649.51453334 num_examples: 278115 download_size: 242478448 dataset_size: 373897649.51453334 configs: - config_name: default data_files: - split: train path: data/train-* size_categories: - 100K<n<1M --- # crumb/c4-benchfilter-nano A 278k sample derivation of the first 3M samples from the C4 dataset for a cheap and short continued pretraining for language models to optimize for benchmark scores without sacrificing generalization and generative modelling unrelated to chat or 'instruct' data. The estimated top 10% of highest estimated length normalized ngram (mean of tri, quad, and penta-gram) overlaps for each of the selected benchmark datasets (arc, truthful_qa, hellaswag, mmlu, humaneval) based on 1k samples, within the first 3M samples of C4. The top scoring sample datasets for each benchmark are then filtered again for top 30% scores and combined and exact-match de-duplicated. Then the top 3% scores and samples less than 200 characters long are removed because they likely have exact large n-token matches by chance such as exact dates or times that aren't actually relevant to the data.\* \*Upon further examination, some of these samples are still present throughout the data, albeit at much lower frequency than before, you might benefit from using `dataset.filter(x['score'] > thresh)` for some threshold, but you risk losing high quality samples as well, this tradeoff should be well-examined before training.
提供机构:
crumb
原始信息汇总

数据集概述

基本信息

  • 语言创建者: 发现
  • 语言: 英语
  • 许可证: odc-by
  • 源数据集: c4
  • 任务类别:
    • 文本生成
    • 填充掩码
  • 任务ID:
    • 语言建模
    • 掩码语言建模

数据集结构

  • 特征:
    • text: 字符串类型
    • score: 浮点数类型 (float64)
  • 分割:
    • train:
      • 字节数: 373897649.51453334
      • 样本数: 278115
  • 下载大小: 242478448
  • 数据集大小: 373897649.51453334

配置

  • 配置名称: default
  • 数据文件:
    • split: train
    • path: data/train-*

大小类别

  • 100K<n<1M

数据集描述

  • 该数据集是从C4数据集的前300万个样本中提取的278k样本,用于语言模型的廉价和短期继续预训练,以优化基准分数,同时不牺牲泛化和生成建模能力,与聊天或“指导”数据无关。
  • 根据每个选定的基准数据集(arc, truthful_qa, hellaswag, mmlu, humaneval)的1k样本,估计长度归一化的ngram(三元、四元和五元)重叠的前10%的最高估计分数。每个基准数据集的最高得分样本再次过滤前30%的分数,并进行合并和精确匹配去重。然后移除前3%的得分样本和长度小于200个字符的样本,因为它们可能偶然有精确的大n-token匹配,如精确的日期或时间,这些实际上与数据无关。
  • 进一步检查发现,这些样本在数据中仍然存在,尽管频率比以前低得多,您可能受益于使用dataset.filter(x[score] > thresh)设置某个阈值,但您也可能失去高质量样本,因此在训练前应仔细权衡这一取舍。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作