oumi-synthetic-claims
收藏魔搭社区2025-12-05 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/oumi-synthetic-claims
下载链接
链接失效反馈官方服务:
资源简介:
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-synthetic-claims
**oumi-synthetic-claims** is a text dataset designed to fine-tune language models for **Claim Verification**.
Prompts and responses were produced synthetically from **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**.
**oumi-synthetic-claims** was used to train **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**, which achieves **77.2% Macro F1**, outperforming **SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.**
- **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference
- **Language(s) (NLP):** English
- **License:** [Llama 3.1 Community License](https://www.llama.com/llama3_1/license/)
## Uses
<!-- This section describes suitable use cases for the dataset. -->
Use this dataset for supervised fine-tuning of LLMs for claim verification.
Fine-tuning Walkthrough: https://oumi.ai/halloumi
## Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset is not well suited for producing generalized chat models.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
{
# Unique conversation identifier
"conversation_id": str,
# Data formatted to user + assistant turns in chat format
# Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# Metadata for sample
"metadata": dict[str, ...],
}
```
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
To enable the community to develop more reliable foundational models, we created this dataset for the purpose of training HallOumi. It was produced by running Oumi inference on Google Cloud.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
The taxonomy used to produce our documents is outlined [here](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y).
Documents were created synthetically using the following criteria:
* Subject
* Document Type
* Information Richness
#### Document Creation Prompt Example:
```
Create a document based on the following criteria:
Subject: Crop Production - Focuses on the cultivation and harvesting of crops, including topics such as soil science, irrigation, fertilizers, and pest management.
Document Type: News Article - 3-6 paragraphs reporting on news on a particular topic.
Information Richness: Low - Document is fairly simple in construction and easy to understand, often discussing things at a high level and not getting too deep into technical details or specifics.
Produce only the document and nothing else. Surround the document in <document> and </document> tags.
Example: <document>This is a very short sentence.</document>
```
#### Response Prompt Example:
```
<document>
...
</document>
Make a claim that is supported/unsupported by the above document.
```
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
Responses were collected by running Oumi batch inference on Google Cloud.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
Data is not known or likely to contain any personal, sensitive, or private information.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
1. The source prompts are generated from Llama-3.1-405B-Instruct and may reflect any biases present in the model.
2. The responses produced will likely be reflective of any biases or limitations produced by Llama-3.1-405B-Instruct.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{oumiSyntheticClaims,
author = {Jeremiah Greer},
title = {Oumi Synthetic Claims},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
```
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-synthetic-claims
**oumi-synthetic-claims** 是一款专为**声明验证(Claim Verification)**任务微调语言模型打造的文本数据集。数据集的提示词与回复均由 **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)** 合成生成。本数据集曾用于训练 **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**,该模型的宏观F1值(Macro F1)可达77.2%,性能超越Claude Sonnet 3.5、OpenAI o1等当前最优模型(SOTA)。
- **数据整理方:** [Oumi AI](https://oumi.ai/) 依托Oumi推理引擎完成构建
- **自然语言处理(NLP)语言:** 英语
- **授权协议:** [Llama 3.1 社区许可协议](https://www.llama.com/llama3_1/license/)
## 数据集用途
<!-- 本章节描述本数据集的适用场景。 -->
本数据集可用于声明验证任务下大语言模型(LLMs)的监督微调。
微调教程:https://oumi.ai/halloumi
## 不适用场景
<!-- 本章节说明数据集的误用、恶意使用场景,以及无法良好适配的应用场景。 -->
本数据集不适用于构建通用对话模型。
## 数据集结构
<!-- 本章节描述数据集的字段信息,以及数据集划分标准、数据点间关联关系等额外结构细节。 -->
{
# 唯一对话标识符
"conversation_id": 字符串类型,
# 按照对话格式组织的用户与助手交互轮次数据
# 示例: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": 字典字符串列表类型,
# 样本元数据
"metadata": 任意键值字典类型,
}
## 数据集构建
### 整理逻辑
<!-- 说明数据集构建的初衷。 -->
为助力社区开发更可靠的基础模型,我们构建本数据集用于训练HallOumi模型。数据集通过谷歌云平台运行Oumi推理引擎生成。
### 源数据
<!-- 本章节说明数据集的源数据来源,例如新闻文本与标题、社交媒体帖文、译句等。 -->
本数据集文档所采用的分类体系详见 [此处](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y)。
文档按照以下规则合成生成:
* 主题
* 文档类型
* 信息丰富度
#### 文档生成提示词示例:
根据以下规则生成一篇文档:
主题:作物生产——聚焦作物种植与收获相关内容,涵盖土壤科学、灌溉、肥料及病虫害防治等话题。
文档类型:新闻文章——3至6段,围绕特定主题报道新闻事件。
信息丰富度:低——文档结构简洁易懂,通常仅进行高层次讨论,不深入技术细节或具体参数。
仅输出文档内容,请勿附加其他信息。用<document>与</document>标签包裹文档。
示例:<document>This is a very short sentence.</document>
#### 回复生成提示词示例:
<document>
...
</document>
根据上述文档内容,生成一条符合或不符合文档支撑的声明。
#### 数据收集与处理
<!-- 本章节说明数据收集与处理流程,例如数据选择标准、过滤与归一化方法、所用工具与库等。 -->
回复数据通过谷歌云平台运行Oumi批量推理引擎收集得到。
#### 个人与敏感信息
<!-- 说明数据集是否包含可被视为个人、敏感或隐私的数据(例如:泄露地址、唯一可识别姓名或别名、种族或族裔出身、性取向、宗教信仰、政治观点、财务或健康数据等)。若已采取匿名化处理,请描述匿名化流程。 -->
本数据集未包含,且大概率不会包含任何个人、敏感或隐私信息。
## 偏差、风险与局限性
<!-- 本章节旨在说明数据集的技术与社会技术层面局限性。 -->
1. 源提示词由Llama-3.1-405B-Instruct生成,可能会反映该模型本身存在的各类偏差。
2. 生成的回复大概率会体现Llama-3.1-405B-Instruct所存在的偏差与局限性。
## 引用声明
<!-- 若有介绍本数据集的论文或博客文章,请在此处提供其APA与Bibtex引用格式信息。 -->
**BibTeX格式:**
@misc{oumiSyntheticClaims,
author = {Jeremiah Greer},
title = {Oumi Synthetic Claims},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
提供机构:
maas
创建时间:
2025-04-09



