five

cerebras/Synth-Long-SFT32K

收藏
Hugging Face2025-02-19 更新2025-04-08 收录
下载链接:
https://hf-mirror.com/datasets/cerebras/Synth-Long-SFT32K
下载链接
链接失效反馈
官方服务:
资源简介:
该仓库包含几个数据集的增强版本:合成ConvQA数据集、NarrativeQA数据集和RAG-TGE数据集。这些数据集用于长指令遵循训练,示例的最大序列长度为32,768。增强包括使用RAFT(检索增强微调)方法,将每个数据集中的(段落、问题、答案)转换为(真实段落、干扰段落_0,...,干扰段落_k,问题,答案),并打乱真实段落在上下文中的位置。对于合成ConvQA数据集,还增加了基于句法的提问。NarrativeQA数据集的变体根据问题和段落创建两个聚类分配,增加了来自相同段落和问题聚类的其他示例。RAG-TGE数据集应用了与合成ConvQA相同的增强策略。此外,RAG-TGE数据集还提供了中文翻译版本。

This repository contains augmented versions of several datasets: Synthetic-ConvQA, NarrativeQA, and RAG-TGE. These datasets are used for long instruction-following training with a maximal sequence length of 32,768 for the examples. The augmentation includes using the RAFT (Retrieval Augmented Fine-Tuning) approach to transform each datasets (passage, question, answer) into (true_passage, distractor_passage_0, ..., distractor_passage_k, question, answer) and shuffling the true passage to a random position in the context. For the Synthetic-ConvQA dataset, additional syntactic questions are added. The variation for NarrativeQA creates two clustering assignments based on questions and passages, adding examples from the same clusters. The RAG-TGE dataset applies the same augmentation strategy as Synthetic-ConvQA, and a Chinese translation version of the RAFT-augmented RAG-TGE dataset is also provided.
提供机构:
cerebras
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作