cerebras/Synth-Long-SFT32K

Name: cerebras/Synth-Long-SFT32K
Creator: cerebras
Published: 2025-02-19 17:33:11
License: 暂无描述

Hugging Face2025-02-19 更新2025-04-08 收录

下载链接：

https://hf-mirror.com/datasets/cerebras/Synth-Long-SFT32K

下载链接

链接失效反馈

官方服务：

资源简介：

该仓库包含几个数据集的增强版本：合成ConvQA数据集、NarrativeQA数据集和RAG-TGE数据集。这些数据集用于长指令遵循训练，示例的最大序列长度为32,768。增强包括使用RAFT（检索增强微调）方法，将每个数据集中的（段落、问题、答案）转换为（真实段落、干扰段落_0，...，干扰段落_k，问题，答案），并打乱真实段落在上下文中的位置。对于合成ConvQA数据集，还增加了基于句法的提问。NarrativeQA数据集的变体根据问题和段落创建两个聚类分配，增加了来自相同段落和问题聚类的其他示例。RAG-TGE数据集应用了与合成ConvQA相同的增强策略。此外，RAG-TGE数据集还提供了中文翻译版本。

This repository contains augmented versions of several datasets: Synthetic-ConvQA, NarrativeQA, and RAG-TGE. These datasets are used for long instruction-following training with a maximal sequence length of 32,768 for the examples. The augmentation includes using the RAFT (Retrieval Augmented Fine-Tuning) approach to transform each datasets (passage, question, answer) into (true_passage, distractor_passage_0, ..., distractor_passage_k, question, answer) and shuffling the true passage to a random position in the context. For the Synthetic-ConvQA dataset, additional syntactic questions are added. The variation for NarrativeQA creates two clustering assignments based on questions and passages, adding examples from the same clusters. The RAG-TGE dataset applies the same augmentation strategy as Synthetic-ConvQA, and a Chinese translation version of the RAFT-augmented RAG-TGE dataset is also provided.

提供机构：

cerebras

5,000+

优质数据集

54 个

任务类型

进入经典数据集