tulu-v2-sft-mixture-olmo-2048
收藏魔搭社区2025-07-16 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/tulu-v2-sft-mixture-olmo-2048
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Tulu V2 Mix (2048 OLMo version)
*Note the [ODC-BY license](https://opendatacommons.org/licenses/by/1-0/), indicating that different licenses apply to subsets of the data. This means that some portions of the dataset are non-commercial. We present the mixture as a research artifact.*
This is a modified version of the [Tulu V2 Mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) used to train [OLMo-Instruct](https://huggingface.co/allenai/OLMo-7B-Instruct).
The two primary differences are: long conversations are resplit into 2048-token chunks, and the hardcoded subset has been replaced with similar examples about OLMo rather than Tulu.
Please see the original [Tulu V2 Mix dataset card](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture) for details!
### License
We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.
# Tulu V2混合数据集卡片(2048 Token OLMo版本)
*注:本数据集采用[ODC-BY许可协议(ODC-BY)](https://opendatacommons.org/licenses/by/1-0/),数据集各子集适用不同许可条款。这意味着本数据集的部分内容不可用于商业用途。本混合数据集仅作为研究成果发布。*
本数据集是用于训练[OLMo-Instruct](https://huggingface.co/allenai/OLMo-7B-Instruct)的[Tulu V2混合数据集(Tulu V2 Mix)](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)的修改版本。
本次修改主要包含两处核心调整:其一,将长对话重新切割为2048 Token的片段;其二,将原数据集中硬编码的子集替换为围绕OLMo而非Tulu的相似示例。
如需了解详细信息,请参阅原始的[Tulu V2混合数据集卡片](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)。
### 许可协议
本数据集采用[ODC-BY许可协议(ODC-BY)](https://opendatacommons.org/licenses/by/1-0/)发布。使用本数据集的用户,同时需遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use/)中与数据集所含内容相关的规定。
提供机构:
maas
创建时间:
2025-05-28



