patrickjmcbride/math-instruct-binned

Name: patrickjmcbride/math-instruct-binned
Creator: patrickjmcbride
Published: 2024-06-06 00:10:19
License: 暂无描述

Hugging Face2024-06-06 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/patrickjmcbride/math-instruct-binned

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: context dtype: string - name: output dtype: string - name: text dtype: string splits: - name: small num_bytes: 1864543974 num_examples: 231315 - name: medium num_bytes: 5961505276 num_examples: 584345 - name: large num_bytes: 2026722938 num_examples: 154320 download_size: 4895917273 dataset_size: 9852772188 configs: - config_name: default data_files: - split: small path: data/small-* - split: medium path: data/medium-* - split: large path: data/large-* tags: - math - mathematics - probability - statistics - algebra - liner algebra license: apache-2.0 task_categories: - text-generation size_categories: - 100K<n<1M --- This is a pre-binned instruction formatted version of Feynman Innovations' [Maths-College](https://huggingface.co/datasets/ajibawa-2023/Maths-College) dataset. Credit to [Feynman Innovations](https://huggingface.co/ajibawa-2023) for the base dataset. It is formatted to be ready for fine-tuning an instruct model. The splits are based on the size of the full instruction ('text') after being tokenized with the Llama-3-8B-Instruct tokenizer (based on tiktoken). A non-split version is avalible as [math-instruct-dataset](https://huggingface.co/datasets/patrickjmcbride/math-instruct-dataset) Binned by length of tokenized 'text' field - small: [min-1024) - medium: [1024-1536) - large: [1536, max] # Data Fields The data fields are as follows: instruction: describes the task that the model needs to perform. (all instructions are the same "Write an educational piece related to the following text snippet:") context: additional context containing the math concept to explain output: an in depth explanation of the concept from the context text: the instruction, context and output formatted with a prompt template to be used for fine-tuning. # Splits small: 231,315 (24%) medium: 584,345 (60%) large: 154,320 (16%)

提供机构：

patrickjmcbride

原始信息汇总

数据集概述

数据特征

instruction: 描述模型需要执行的任务（所有指令相同：“写一篇与以下文本片段相关的教育文章：”）
context: 包含需要解释的数学概念的附加上下文
output: 对上下文中概念的深入解释
text: 用于微调的提示模板格式的指令、上下文和输出

数据分割

small: 包含231,315个样本，占总样本的24%
medium: 包含584,345个样本，占总样本的60%
large: 包含154,320个样本，占总样本的16%

数据集大小

下载大小: 4,895,917,273字节
数据集大小: 9,852,772,188字节

配置

default:
- small: 数据路径为data/small-*
- medium: 数据路径为data/medium-*
- large: 数据路径为data/large-*

许可

apache-2.0

任务类别

text-generation

大小类别

100K<n<1M

5,000+

优质数据集

54 个

任务类型

进入经典数据集