StackMix-v0.1

Name: StackMix-v0.1
Creator: maas
Published: 2025-11-27 16:32:33
License: 暂无描述

魔搭社区2025-11-27 更新2025-05-10 收录

下载链接：

https://modelscope.cn/datasets/kalomaze/StackMix-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

An experimental small-ish dataset I whipped up in an afternoon. What I did: - Curated StackOverflow and other StackExchange subforums for some good examples that were **not** GPT generated, including writing advice & other real world questions. - Duplicated said examples with different prompt formatting (so it would generalize to both Alpaca & the official llama2-chat prompt layouts) - Two long context outliers to ensure long context works (a TV episode script, 10k tokens, and the first few chapters of 1984, 32k tokens.) - Another example which is a combination of the one shot responses one after the other in a long context (to help teach the model to sometimes ignore older parts of context when appropriate and not overfit/repeat) This comes out to about ~60k tokens total, give or take.

本数据集为一款实验性轻量型数据集，由我在一个下午内仓促搭建完成。具体制作流程如下： - 从StackOverflow及其他StackExchange子社区中精选了一批非GPT生成的优质示例，涵盖写作建议及其他真实场景下的问题。 - 针对上述示例，采用不同的提示词格式进行复刻，以确保模型可同时适配Alpaca与官方llama2-chat的提示词模板。 - 设置两个长上下文异常样本以验证长上下文处理能力：其一为一段10,000词元（Token）的电视剧本，其二为《一九八四》的前几章，共计32,000词元（Token）。 - 额外设置一个示例：将多个单样本响应按顺序拼接为长上下文，用于帮助模型学会在合适的场景下忽略旧的上下文内容，避免过拟合或重复生成。该数据集总词元数约为60,000词元，上下略有偏差。

提供机构：

maas

创建时间：

2025-05-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集