oumi-synthetic-document-claims
收藏魔搭社区2025-10-09 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/oumi-synthetic-document-claims
下载链接
链接失效反馈官方服务:
资源简介:
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-synthetic-document-claims
**oumi-synthetic-document-claims** is a text dataset designed to fine-tune language models for **Claim Verification**.
Prompts and responses were produced synthetically from **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**.
**oumi-synthetic-document-claims** was used to train **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**, which achieves **77.2% Macro F1**, outperforming **SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.**
- **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference
- **Language(s) (NLP):** English
- **License:** [Llama 3.1 Community License](https://www.llama.com/llama3_1/license/)
## Uses
Use this dataset for supervised fine-tuning of LLMs for claim verification.
Fine-tuning Walkthrough: https://oumi.ai/halloumi
## Out-of-Scope Use
This dataset is not well suited for producing generalized chat models.
## Dataset Structure
```
{
# Unique conversation identifier
"conversation_id": str,
# Data formatted to user + assistant turns in chat format
# Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# Metadata for sample
"metadata": dict[str, ...],
}
```
## Dataset Creation
### Curation Rationale
To enable the community to develop more reliable foundational models, we created this dataset for the purpose of training HallOumi. It was produced by running Oumi inference on Google Cloud.
### Source Data
The taxonomy used to produce our documents is outlined [here](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y).
#### Document Creation:
Documents were created synthetically using the following criteria:
* Subject
* Document Type
* Information Richness
Example prompt:
```
Create a document based on the following criteria:
Subject: Crop Production - Focuses on the cultivation and harvesting of crops, including topics such as soil science, irrigation, fertilizers, and pest management.
Document Type: News Article - 3-6 paragraphs reporting on news on a particular topic.
Information Richness: Low - Document is fairly simple in construction and easy to understand, often discussing things at a high level and not getting too deep into technical details or specifics.
Produce only the document and nothing else. Surround the document in and tags.
Example: This is a very short sentence.
```
#### Request Creation
Requests were randomly assigned to one of a few types:
* Summarization (concise)
* Summarization (constrained)
* Summarization (full)
* Summarization (stylized)
* QA (irrelevant)
* QA (missing answer)
* QA (conflicting answer)
* QA (complete answer)
* QA (partial answer)
* QA (inferrable answer)
Example prompt:
```
Entrepreneurship 101: Turning Your Idea into a Reality
Starting a business can be a daunting task, especially for those who are new to the world of entrepreneurship. However, with the right mindset and a solid understanding of the basics, anyone can turn their idea into a successful venture. In this post, we'll cover the key steps to take when starting a business, from ideation to funding and beyond.
It all begins with an idea. Maybe you've identified a problem in your community that you'd like to solve, or perhaps you have a passion that you'd like to turn into a career. Whatever your idea may be, it's essential to take the time to refine it and make sure it's viable. Ask yourself questions like "Who is my target audience?" and "What sets my product or service apart from the competition?" Once you have a solid idea, it's time to start thinking about funding.
There are several options when it comes to funding a business, including bootstrapping, venture capital, and angel investors. Bootstrapping involves using your own savings or revenue to fund your business, while venture capital and angel investors involve seeking out external funding from investors. Each option has its pros and cons, and it's crucial to choose the one that best fits your business needs. For example, bootstrapping allows you to maintain control over your business, but it can also limit your growth potential. On the other hand, seeking out external funding can provide the resources you need to scale quickly, but it may require you to give up some equity.
Once you've secured funding, it's time to start thinking about strategies for success. This includes things like building a strong team, developing a marketing plan, and creating a solid business model. It's also essential to be flexible and adapt to changes in the market or unexpected setbacks. By staying focused and committed to your vision, you can overcome obstacles and build a successful business.
In conclusion, starting a business requires careful planning, hard work, and dedication. By refining your idea, securing funding, and developing strategies for success, you can turn your passion into a reality and achieve your entrepreneurial goals.
Create a request for the above document based on the following criteria:
Task: Summarization - Concise - Create a request for a summary that is short and concise (1-2 sentences).
Produce only the request and nothing else. Surround the request in and tags.
Example: This is a request.
```
#### Response
Response creation is straightforward, as it’s effectively just combining the context and request and sending this as an actual request to an LLM.
Prompt example:
```
Entrepreneurship 101: Turning Your Idea into a Reality
Starting a business can be a daunting task, especially for those who are new to the world of entrepreneurship. However, with the right mindset and a solid understanding of the basics, anyone can turn their idea into a successful venture. In this post, we'll cover the key steps to take when starting a business, from ideation to funding and beyond.
It all begins with an idea. Maybe you've identified a problem in your community that you'd like to solve, or perhaps you have a passion that you'd like to turn into a career. Whatever your idea may be, it's essential to take the time to refine it and make sure it's viable. Ask yourself questions like "Who is my target audience?" and "What sets my product or service apart from the competition?" Once you have a solid idea, it's time to start thinking about funding.
There are several options when it comes to funding a business, including bootstrapping, venture capital, and angel investors. Bootstrapping involves using your own savings or revenue to fund your business, while venture capital and angel investors involve seeking out external funding from investors. Each option has its pros and cons, and it's crucial to choose the one that best fits your business needs. For example, bootstrapping allows you to maintain control over your business, but it can also limit your growth potential. On the other hand, seeking out external funding can provide the resources you need to scale quickly, but it may require you to give up some equity.
Once you've secured funding, it's time to start thinking about strategies for success. This includes things like building a strong team, developing a marketing plan, and creating a solid business model. It's also essential to be flexible and adapt to changes in the market or unexpected setbacks. By staying focused and committed to your vision, you can overcome obstacles and build a successful business.
In conclusion, starting a business requires careful planning, hard work, and dedication. By refining your idea, securing funding, and developing strategies for success, you can turn your passion into a reality and achieve your entrepreneurial goals.
Summarize the document "Entrepreneurship 101: Turning Your Idea into a Reality" in 1-2 concise sentences, focusing on the main steps to start a business.
Only use information available in the document in your response.
```
#### Citation Generation
To generate responses from Llama 405B, we ran inference with HallOumi and the following 1-shot example through Llama 3.1 405B Instruct on GCP.
Example code:
```
INSTRUCTIONS = """You are an expert AI assistant, and your task is to analyze a provided answer and identify the relevant lines associated with the answer.
You will be given a context, a request, and a response, with each claim of the response separated by tags.
You must output the identifiers of the relevant sentences from the context, an explanation of what those relevant sentences indicate about the claim and whether or not it's supported, as well as a final value of or based on whether or not the claim is supported.
Note that claims which are unsupported likely still have relevant lines indicating that the claim is not supported (due to conflicting or missing information in an appropriate area)."""
EXAMPLE_REQUEST = "Make one or more claims about information in the documents."
EXAMPLE_CONTEXT = """Ipswich manager Mick McCarthy: "The irony was that poor old Alex Smithies cost them the second goal which set us up to win as comprehensively as we did.>He then kept it from being an embarrassing scoreline, but I'll take three.>"""
EXAMPLE_RESPONSE = """"""
EXAMPLE_OUTPUT = """"""
messages = [
{'role': 'system', 'content': INSTRUCTIONS},
{'role': 'user', 'content': f"{EXAMPLE_CONTEXT}{EXAMPLE_RESPONSE}
{'role': 'assistant', 'content': EXAMPLE_OUTPUT,
]
```
To annotate sentences with their appropriate sentence and response tags, we utilized [wtpsplit](https://github.com/segment-any-text/wtpsplit) to split the sentences, removed any empty elements in the split, and annotated them in-order from beginning to end.
After running inference, we performed some basic sanitation on the outputs to ensure outputs were consistent:
* Remove text before and after the final
* Ensure that all responses have the same number of claims (split by ) that they started with
* Remove newlines & start/end whitespace
* Ensure that , , and either or were present in every output claim.
Any samples which did not meet these criteria were generally removed from the data.
#### Data Collection and Processing
Responses were collected by running Oumi batch inference on Google Cloud.
#### Personal and Sensitive Information
Data is not known or likely to contain any personal, sensitive, or private information.
## Bias, Risks, and Limitations
1. The source prompts are generated from Llama-3.1-405B-Instruct and may reflect any biases present in the model.
2. The responses produced will likely be reflective of any biases or limitations produced by Llama-3.1-405B-Instruct.
## Citation
**BibTeX:**
```
@misc{oumiSyntheticDocumentClaims,
author = {Jeremiah Greer},
title = {Oumi Synthetic Document Claims},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-document-claims}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
```
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-synthetic-document-claims
**oumi-synthetic-document-claims** 是一款用于**事实核查(Claim Verification)** 任务的大语言模型(Large Language Model, LLM)微调文本数据集。其提示词与响应均由**Llama-3.1-405B-Instruct** 合成生成。该数据集被用于训练**HallOumi-8B**,该模型实现了77.2%的宏F1值,性能优于当前顶尖(State-of-the-art, SOTA)模型,例如Claude Sonnet 3.5、OpenAI o1等。
- **数据整理:** [Oumi AI](https://oumi.ai/),基于Oumi推理框架完成
- **自然语言处理语言:** 英语
- **许可证:** [Llama 3.1社区许可证](https://www.llama.com/llama3_1/license/)
## 用途
将本数据集用于事实核查任务下的大语言模型监督微调。
微调教程:https://oumi.ai/halloumi
## 超出适用范围的用途
本数据集不适用于构建通用对话模型。
## 数据集结构
{
# 唯一对话标识符
"conversation_id": str,
# 采用对话格式的用户与助手轮次数据
# 示例: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# 样本元数据
"metadata": dict[str, ...],
}
## 数据集构建
### 整理依据
为助力社区开发更可靠的基础模型,我们构建本数据集以用于训练HallOumi。本数据集通过谷歌云平台运行Oumi推理生成。
### 源数据
用于生成文档的分类体系详见[此处](https://docs.google.com/spreadsheets/d/1-Hvy-OyA_HMVNwLY_YRTibE33TpHsO-IeU7wVJ51h3Y)。
#### 文档创建
文档基于以下准则合成生成:
* 主题
* 文档类型
* 信息丰富度
示例提示词:
基于以下准则创建一篇文档:
主题:作物生产——聚焦作物的栽培与收获,涵盖土壤科学、灌溉、肥料及病虫害管理等相关议题。
文档类型:新闻文章——3至6段的特定主题新闻报道。
信息丰富度:低——文档结构简洁易懂,通常仅进行宏观层面的讨论,不深入技术细节或具体参数。
仅输出文档内容,无需额外信息。将文档包裹在<start_of_document>和<end_of_document>标签中。
示例:这是一个非常简短的句子。
#### 请求创建
请求被随机分配至以下若干类型之一:
* 简洁式摘要
* 约束式摘要
* 完整式摘要
* 风格化摘要
* 无关问答
* 无答案问答
* 冲突答案问答
* 完整答案问答
* 部分答案问答
* 可推导答案问答
示例提示词:
创业入门:将想法转化为现实
创业对于新手而言往往充满挑战,但只要具备正确的心态与扎实的基础知识,任何人都能将想法转化为成功的事业。本文将介绍创业的关键步骤,从构思到融资及后续环节。
一切始于一个想法。你可能发现了社区中亟待解决的问题,或是拥有想要转化为职业的热爱之事。无论你的想法是什么,都需要花时间打磨并确认其可行性。可以自问:「我的目标受众是谁?」「我的产品或服务与竞品有何区别?」敲定可行的想法后,就该考虑融资了。
创业融资有多种选择,包括自筹资金、风险投资和天使投资。自筹资金指使用个人储蓄或营收支持创业,而风险投资与天使投资则是寻求外部投资者的资金支持。每种方式都有其利弊,选择最贴合业务需求的方案至关重要。例如,自筹资金可保留对业务的控制权,但可能限制增长潜力;反之,寻求外部投资可提供快速扩张所需的资源,但可能需要让渡部分股权。
敲定融资后,就该思考成功策略,包括组建强大的团队、制定营销计划及构建稳健的商业模式。同时,保持灵活性以适应市场变化或意外挫折也至关重要。只要专注并坚守愿景,就能克服障碍,构建成功的事业。
总而言之,创业需要周密的规划、辛勤的付出与 dedication。通过打磨想法、敲定融资及制定成功策略,你就能将热爱转化为现实,达成创业目标。
基于上述文档,按照以下准则创建请求:
任务:简洁式摘要——创建一个简短(1-2句话)的摘要请求。
仅输出请求内容,无需额外信息。将请求包裹在<start_of_request>和<end_of_request>标签中。
示例:这是一个请求。
#### 响应生成
响应生成流程较为简便,本质上是将上下文与请求拼接后,作为实际请求发送至大语言模型。
提示词示例:
创业入门:将想法转化为现实
创业对于新手而言往往充满挑战,但只要具备正确的心态与扎实的基础知识,任何人都能将想法转化为成功的事业。本文将介绍创业的关键步骤,从构思到融资及后续环节。
一切始于一个想法。你可能发现了社区中亟待解决的问题,或是拥有想要转化为职业的热爱之事。无论你的想法是什么,都需要花时间打磨并确认其可行性。可以自问:「我的目标受众是谁?」「我的产品或服务与竞品有何区别?」敲定可行的想法后,就该考虑融资了。
创业融资有多种选择,包括自筹资金、风险投资和天使投资。自筹资金指使用个人储蓄或营收支持创业,而风险投资与天使投资则是寻求外部投资者的资金支持。每种方式都有其利弊,选择最贴合业务需求的方案至关重要。例如,自筹资金可保留对业务的控制权,但可能限制增长潜力;反之,寻求外部投资可提供快速扩张所需的资源,但可能需要让渡部分股权。
敲定融资后,就该思考成功策略,包括组建强大的团队、制定营销计划及构建稳健的商业模式。同时,保持灵活性以适应市场变化或意外挫折也至关重要。只要专注并坚守愿景,就能克服障碍,构建成功的事业。
总而言之,创业需要周密的规划、辛勤的付出与 dedication。通过打磨想法、敲定融资及制定成功策略,你就能将热爱转化为现实,达成创业目标。
请基于1-2句简洁的语句总结「创业入门:将想法转化为现实」这篇文档,聚焦创业的核心步骤。仅使用文档中提供的信息生成响应。
#### 引用生成
为从Llama-3.1-405B-Instruct生成响应,我们在谷歌云平台上通过HallOumi及以下单样本示例调用该模型进行推理。
示例代码:
INSTRUCTIONS = """您是一名专业AI助手,任务为分析给定答案并识别与该答案相关的语句行。
您将获得上下文、请求及响应,响应中的每个声明均通过<claim>与</claim>标签分隔。
您必须输出上下文中相关语句的标识符、对这些相关语句的解释,以及这些语句是否支持该声明,最后根据声明是否被支持输出<supported>或<refuted>标签。
请注意,即使声明未被支持,通常也会存在相关语句表明该声明不被支持(因对应区域存在冲突或缺失信息)。"""
EXAMPLE_REQUEST = "针对文档中的信息提出一项或多项声明。"
EXAMPLE_CONTEXT = """伊普斯维奇主教练米克·麦卡锡:「具有讽刺意味的是,可怜的亚历克斯·史密斯斯丢掉了第二球,这让我们得以全面获胜。>他随后阻止了比分变得难堪,但我还是拿下了三分。>"""
EXAMPLE_RESPONSE = """"
EXAMPLE_OUTPUT = """"
messages = [
{'role': 'system', 'content': INSTRUCTIONS},
{'role': 'user', 'content': f"{EXAMPLE_CONTEXT}{EXAMPLE_RESPONSE}"},
{'role': 'assistant', 'content': EXAMPLE_OUTPUT},
]
为使用恰当的语句和响应标签标注句子,我们使用[wtpsplit](https://github.com/segment-any-text/wtpsplit)进行分句,移除拆分后的空元素,并按从前往后的顺序进行标注。推理完成后,我们对输出进行了基础清洗以确保一致性:
* 移除<start_of_response>之前及最终<end_of_response>之后的文本
* 确保所有响应的声明数量(按<claim>标签拆分)与初始数量一致
* 移除换行符及首尾空白字符
* 确保每个输出声明中均包含<claim>、</claim>以及<supported>或<refuted>标签。
未满足上述准则的样本将从数据集中移除。
#### 数据收集与处理
通过谷歌云平台运行Oumi批量推理收集响应数据。
#### 个人与敏感信息
本数据集未包含且大概率不存在任何个人、敏感或隐私信息。
## 偏差、风险与局限性
1. 源提示词由Llama-3.1-405B-Instruct生成,可能反映该模型存在的所有偏差。
2. 生成的响应大概率会反映Llama-3.1-405B-Instruct存在的偏差或局限性。
## 引用
**BibTeX格式引用:**
@misc{oumiSyntheticDocumentClaims,
author = {Jeremiah Greer},
title = {Oumi Synthetic Document Claims},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-synthetic-document-claims}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
提供机构:
maas
创建时间:
2025-04-09



