StackMix-v0.1
收藏魔搭社区2025-11-27 更新2025-05-10 收录
下载链接:
https://modelscope.cn/datasets/kalomaze/StackMix-v0.1
下载链接
链接失效反馈官方服务:
资源简介:
An experimental small-ish dataset I whipped up in an afternoon.
What I did:
- Curated StackOverflow and other StackExchange subforums for some good examples that were **not** GPT generated, including writing advice & other real world questions.
- Duplicated said examples with different prompt formatting (so it would generalize to both Alpaca & the official llama2-chat prompt layouts)
- Two long context outliers to ensure long context works (a TV episode script, 10k tokens, and the first few chapters of 1984, 32k tokens.)
- Another example which is a combination of the one shot responses one after the other in a long context (to help teach the model to sometimes ignore older parts of context when appropriate and not overfit/repeat)
This comes out to about ~60k tokens total, give or take.
本数据集为一款实验性轻量型数据集,由我在一个下午内仓促搭建完成。
具体制作流程如下:
- 从StackOverflow及其他StackExchange子社区中精选了一批非GPT生成的优质示例,涵盖写作建议及其他真实场景下的问题。
- 针对上述示例,采用不同的提示词格式进行复刻,以确保模型可同时适配Alpaca与官方llama2-chat的提示词模板。
- 设置两个长上下文异常样本以验证长上下文处理能力:其一为一段10,000词元(Token)的电视剧本,其二为《一九八四》的前几章,共计32,000词元(Token)。
- 额外设置一个示例:将多个单样本响应按顺序拼接为长上下文,用于帮助模型学会在合适的场景下忽略旧的上下文内容,避免过拟合或重复生成。
该数据集总词元数约为60,000词元,上下略有偏差。
提供机构:
maas
创建时间:
2025-05-06



