SciRIFF-train-mix
收藏魔搭社区2025-08-01 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/SciRIFF-train-mix
下载链接
链接失效反馈官方服务:
资源简介:
# SciRIFF training mix
This dataset includes the training mix used to train the SciTulu models described in our paper [SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature](https://arxiv.org/abs/2406.07835).
It contains 35K instances from the [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF) dataset (1,000 instances per train task), together with a matching number of instances randomly sampled from the [Tulu V2 mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture). See the dataset cards of the component datasets for more information.
You can load the dataset like:
```python
import datasets
ds = datasets.load_dataset("allenai/SciRIFF-train-mix")
```
The instances are formatted the same way as in the Tulu V2 mix, with the following fields:
- `dataset`: Identifier for the dataset this instance belongs to.
- `id`: Unique ID for the instance. For SciRIFF instances, this is the same as the `_instance_id` field in the [SciRIFF dataset](https://huggingface.co/datasets/allenai/SciRIFF).
- `messages`: A list of messages, for instance:
```python
[
{"role": "user", "content": [user message]},
{"role": "assistant", "content": [model response]}
]
```
For more information on how to train models using this dataset, see our GitHub repository: https://github.com/allenai/SciRIFF.
# SciRIFF训练混合数据集
本数据集为论文《SciRIFF:提升大语言模型在科学文献上的指令遵循能力的资源》(https://arxiv.org/abs/2406.07835)中所介绍的SciTulu模型的训练所用的训练混合数据集。
本数据集包含3.5万条来自SciRIFF数据集的样本(每个训练任务包含1000条样本),同时搭配从Tulu V2混合数据集(https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)中随机采样的等量样本。更多详情请参阅各组分数据集的数据集卡片文档。
可通过以下代码加载该数据集:
python
import datasets
ds = datasets.load_dataset("allenai/SciRIFF-train-mix")
该数据集的样本格式与Tulu V2混合数据集一致,包含以下字段:
- `dataset`:该样本所属数据集的标识符。
- `id`:该样本的唯一标识符。对于SciRIFF样本而言,其值与SciRIFF数据集(https://huggingface.co/datasets/allenai/SciRIFF)中的`_instance_id`字段一致。
- `messages`:消息列表,示例如下:
python
[
{"role": "user", "content": [user message]},
{"role": "assistant", "content": [model response]}
]
如需了解使用该数据集训练模型的更多详情,请参阅我们的GitHub仓库:https://github.com/allenai/SciRIFF。
提供机构:
maas
创建时间:
2025-05-27



