five

patrickjmcbride/math-instruct-binned

收藏
Hugging Face2024-06-06 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/patrickjmcbride/math-instruct-binned
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: text dtype: string splits: - name: small num_bytes: 1864543974 num_examples: 231315 - name: medium num_bytes: 5961505276 num_examples: 584345 - name: large num_bytes: 2026722938 num_examples: 154320 download_size: 4895917273 dataset_size: 9852772188 configs: - config_name: default data_files: - split: small path: data/small-* - split: medium path: data/medium-* - split: large path: data/large-* tags: - math - mathematics - probability - statistics - algebra - liner algebra license: apache-2.0 task_categories: - text-generation size_categories: - 100K<n<1M --- This is a pre-binned instruction formatted version of Feynman Innovations' [Maths-College](https://huggingface.co/datasets/ajibawa-2023/Maths-College) dataset. Credit to [Feynman Innovations](https://huggingface.co/ajibawa-2023) for the base dataset. It is formatted to be ready for fine-tuning an instruct model. The splits are based on the size of the full instruction ('text') after being tokenized with the Llama-3-8B-Instruct tokenizer (based on tiktoken). A non-split version is avalible as [math-instruct-dataset](https://huggingface.co/datasets/patrickjmcbride/math-instruct-dataset) Binned by length of tokenized 'text' field - small: [min-1024) - medium: [1024-1536) - large: [1536, max] # Data Fields The data fields are as follows: instruction: describes the task that the model needs to perform. (all instructions are the same "Write an educational piece related to the following text snippet:") context: additional context containing the math concept to explain output: an in depth explanation of the concept from the context text: the instruction, context and output formatted with a prompt template to be used for fine-tuning. # Splits small: 231,315 (24%) medium: 584,345 (60%) large: 154,320 (16%)
提供机构:
patrickjmcbride
原始信息汇总

数据集概述

数据特征

  • instruction: 描述模型需要执行的任务(所有指令相同:“写一篇与以下文本片段相关的教育文章:”)
  • context: 包含需要解释的数学概念的附加上下文
  • output: 对上下文中概念的深入解释
  • text: 用于微调的提示模板格式的指令、上下文和输出

数据分割

  • small: 包含231,315个样本,占总样本的24%
  • medium: 包含584,345个样本,占总样本的60%
  • large: 包含154,320个样本,占总样本的16%

数据集大小

  • 下载大小: 4,895,917,273字节
  • 数据集大小: 9,852,772,188字节

配置

  • default:
    • small: 数据路径为data/small-*
    • medium: 数据路径为data/medium-*
    • large: 数据路径为data/large-*

标签

  • math
  • mathematics
  • probability
  • statistics
  • algebra
  • liner algebra

许可

  • apache-2.0

任务类别

  • text-generation

大小类别

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作