five

SciRIFF-train-mix

收藏
魔搭社区2025-08-01 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/SciRIFF-train-mix
下载链接
链接失效反馈
官方服务:
资源简介:
# SciRIFF training mix This dataset includes the training mix used to train the SciTulu models described in our paper [SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature](https://arxiv.org/abs/2406.07835). It contains 35K instances from the [SciRIFF](https://huggingface.co/datasets/allenai/SciRIFF) dataset (1,000 instances per train task), together with a matching number of instances randomly sampled from the [Tulu V2 mix](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture). See the dataset cards of the component datasets for more information. You can load the dataset like: ```python import datasets ds = datasets.load_dataset("allenai/SciRIFF-train-mix") ``` The instances are formatted the same way as in the Tulu V2 mix, with the following fields: - `dataset`: Identifier for the dataset this instance belongs to. - `id`: Unique ID for the instance. For SciRIFF instances, this is the same as the `_instance_id` field in the [SciRIFF dataset](https://huggingface.co/datasets/allenai/SciRIFF). - `messages`: A list of messages, for instance: ```python [ {"role": "user", "content": [user message]}, {"role": "assistant", "content": [model response]} ] ``` For more information on how to train models using this dataset, see our GitHub repository: https://github.com/allenai/SciRIFF.

# SciRIFF训练混合数据集 本数据集为论文《SciRIFF:提升大语言模型在科学文献上的指令遵循能力的资源》(https://arxiv.org/abs/2406.07835)中所介绍的SciTulu模型的训练所用的训练混合数据集。 本数据集包含3.5万条来自SciRIFF数据集的样本(每个训练任务包含1000条样本),同时搭配从Tulu V2混合数据集(https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)中随机采样的等量样本。更多详情请参阅各组分数据集的数据集卡片文档。 可通过以下代码加载该数据集: python import datasets ds = datasets.load_dataset("allenai/SciRIFF-train-mix") 该数据集的样本格式与Tulu V2混合数据集一致,包含以下字段: - `dataset`:该样本所属数据集的标识符。 - `id`:该样本的唯一标识符。对于SciRIFF样本而言,其值与SciRIFF数据集(https://huggingface.co/datasets/allenai/SciRIFF)中的`_instance_id`字段一致。 - `messages`:消息列表,示例如下: python [ {"role": "user", "content": [user message]}, {"role": "assistant", "content": [model response]} ] 如需了解使用该数据集训练模型的更多详情,请参阅我们的GitHub仓库:https://github.com/allenai/SciRIFF。
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作