oumi-synthetic-claims

Name: oumi-synthetic-claims
Creator: maas
Published: 2025-12-05 16:29:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-04-12 收录

下载链接：

https://modelscope.cn/datasets/oumi-ai/oumi-synthetic-claims

下载链接

链接失效反馈

官方服务：

资源简介：

[![oumi logo](https://oumi.ai/logo_lockup_black.svg)](https://github.com/oumi-ai/oumi) [![Made with Oumi](https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg)](https://github.com/oumi-ai/oumi) [![Documentation](https://img.shields.io/badge/Documentation-oumi-blue.svg)](https://oumi.ai/docs/en/latest/index.html) [![Blog](https://img.shields.io/badge/Blog-oumi-blue.svg)](https://oumi.ai/blog) [![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi) # oumi-ai/oumi-synthetic-claims **oumi-synthetic-claims** is a text dataset designed to fine-tune language models for **Claim Verification**. Prompts and responses were produced synthetically from **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**. **oumi-synthetic-claims** was used to train **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**, which achieves **77.2% Macro F1**, outperforming **SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.** - **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference - **Language(s) (NLP):** English - **License:** [Llama 3.1 Community License](https://www.llama.com/llama3_1/license/) ## Uses  Use this dataset for supervised fine-tuning of LLMs for claim verification. Fine-tuning Walkthrough: https://oumi.ai/halloumi ## Out-of-Scope Use  This dataset is not well suited for producing generalized chat models. ## Dataset Structure  ``` { # Unique conversation identifier "conversation_id": str, # Data formatted to user + assistant turns in chat format # Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": list[dict[str, str]], # Metadata for sample "metadata": dict[str, ...], } ``` ## Dataset Creation ### Curation Rationale  To enable the community to develop more reliable foundational models, we created this dataset for the purpose of training HallOumi. It was produced by running Oumi inference on Google Cloud. ### Source Data  The taxonomy used to produce our documents is outlined [here](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y). Documents were created synthetically using the following criteria: * Subject * Document Type * Information Richness #### Document Creation Prompt Example: ``` Create a document based on the following criteria: Subject: Crop Production - Focuses on the cultivation and harvesting of crops, including topics such as soil science, irrigation, fertilizers, and pest management. Document Type: News Article - 3-6 paragraphs reporting on news on a particular topic. Information Richness: Low - Document is fairly simple in construction and easy to understand, often discussing things at a high level and not getting too deep into technical details or specifics. Produce only the document and nothing else. Surround the document in <document> and </document> tags. Example: <document>This is a very short sentence.</document> ``` #### Response Prompt Example: ``` <document> ... </document> Make a claim that is supported/unsupported by the above document. ``` #### Data Collection and Processing  Responses were collected by running Oumi batch inference on Google Cloud. #### Personal and Sensitive Information  Data is not known or likely to contain any personal, sensitive, or private information. ## Bias, Risks, and Limitations  1. The source prompts are generated from Llama-3.1-405B-Instruct and may reflect any biases present in the model. 2. The responses produced will likely be reflective of any biases or limitations produced by Llama-3.1-405B-Instruct. ## Citation  **BibTeX:** ``` @misc{oumiSyntheticClaims, author = {Jeremiah Greer}, title = {Oumi Synthetic Claims}, month = {March}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} } ```

[![oumi logo](https://oumi.ai/logo_lockup_black.svg)](https://github.com/oumi-ai/oumi) [![基于Oumi构建](https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg)](https://github.com/oumi-ai/oumi) [![文档](https://img.shields.io/badge/Documentation-oumi-blue.svg)](https://oumi.ai/docs/en/latest/index.html) [![博客](https://img.shields.io/badge/Blog-oumi-blue.svg)](https://oumi.ai/blog) [![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi) # oumi-ai/oumi-synthetic-claims **oumi-synthetic-claims** 是一款专为**声明验证（Claim Verification）**任务微调语言模型打造的文本数据集。数据集的提示词与回复均由 **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)** 合成生成。本数据集曾用于训练 **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**，该模型的宏观F1值（Macro F1）可达77.2%，性能超越Claude Sonnet 3.5、OpenAI o1等当前最优模型（SOTA）。 - **数据整理方：** [Oumi AI](https://oumi.ai/) 依托Oumi推理引擎完成构建 - **自然语言处理（NLP）语言：** 英语 - **授权协议：** [Llama 3.1 社区许可协议](https://www.llama.com/llama3_1/license/) ## 数据集用途  本数据集可用于声明验证任务下大语言模型（LLMs）的监督微调。微调教程：https://oumi.ai/halloumi ## 不适用场景  本数据集不适用于构建通用对话模型。 ## 数据集结构  { # 唯一对话标识符 "conversation_id": 字符串类型, # 按照对话格式组织的用户与助手交互轮次数据 # 示例: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": 字典字符串列表类型, # 样本元数据 "metadata": 任意键值字典类型, } ## 数据集构建 ### 整理逻辑  为助力社区开发更可靠的基础模型，我们构建本数据集用于训练HallOumi模型。数据集通过谷歌云平台运行Oumi推理引擎生成。 ### 源数据  本数据集文档所采用的分类体系详见 [此处](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y)。文档按照以下规则合成生成： * 主题 * 文档类型 * 信息丰富度 #### 文档生成提示词示例：根据以下规则生成一篇文档：主题：作物生产——聚焦作物种植与收获相关内容，涵盖土壤科学、灌溉、肥料及病虫害防治等话题。文档类型：新闻文章——3至6段，围绕特定主题报道新闻事件。信息丰富度：低——文档结构简洁易懂，通常仅进行高层次讨论，不深入技术细节或具体参数。仅输出文档内容，请勿附加其他信息。用<document>与</document>标签包裹文档。示例：<document>This is a very short sentence.</document> #### 回复生成提示词示例： <document> ... </document> 根据上述文档内容，生成一条符合或不符合文档支撑的声明。 #### 数据收集与处理  回复数据通过谷歌云平台运行Oumi批量推理引擎收集得到。 #### 个人与敏感信息  本数据集未包含，且大概率不会包含任何个人、敏感或隐私信息。 ## 偏差、风险与局限性  1. 源提示词由Llama-3.1-405B-Instruct生成，可能会反映该模型本身存在的各类偏差。 2. 生成的回复大概率会体现Llama-3.1-405B-Instruct所存在的偏差与局限性。 ## 引用声明  **BibTeX格式：** @misc{oumiSyntheticClaims, author = {Jeremiah Greer}, title = {Oumi Synthetic Claims}, month = {March}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} }

提供机构：

maas

创建时间：

2025-04-09

搜集汇总

数据集介绍

背景与挑战

背景概述

oumi-synthetic-claims是一个用于微调语言模型进行声明验证的文本数据集，由Llama-3.1-405B-Instruct生成，并用于训练性能优于SOTA模型的HallOumi-8B。该数据集由Oumi AI策划，语言为英语，采用Llama 3.1 Community License许可证，旨在通过监督微调提升模型在声明验证任务中的表现。

以上内容由遇见数据集搜集并总结生成