oumi-anli-subset
收藏魔搭社区2025-12-05 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/oumi-anli-subset
下载链接
链接失效反馈官方服务:
资源简介:
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-anli-subset
**oumi-anli-subset** is a text dataset designed to fine-tune language models for **Claim Verification**.
Prompts were pulled from [ANLI](https://huggingface.co/datasets/facebook/anli) training sets with responses created from **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**.
**oumi-anli-subset** was used to train **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**, which achieves **77.2% Macro F1**, outperforming **SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.**
- **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference
- **Language(s) (NLP):** English
- **License:** [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en), [Llama 3.1 Community License](https://www.llama.com/llama3_1/license/)
## Uses
<!-- This section describes suitable use cases for the dataset. -->
Use this dataset for supervised fine-tuning of LLMs for claim verification.
Fine-tuning Walkthrough: https://oumi.ai/halloumi
## Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
This dataset is not well suited for producing generalized chat models.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
{
# Unique conversation identifier
"conversation_id": str,
# Data formatted to user + assistant turns in chat format
# Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# Metadata for sample
"metadata": dict[str, ...],
}
```
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
To enable the community to develop more reliable foundational models, we created this dataset for the purpose of training HallOumi. It was produced by running Oumi inference on Google Cloud.
### Source Data
<!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). -->
Queries were sourced from [ANLI](https://huggingface.co/datasets/facebook/anli).
#### Data Collection and Processing
<!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. -->
Responses were collected by running Oumi batch inference on Google Cloud.
#### Personal and Sensitive Information
<!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. -->
Data is not known or likely to contain any personal, sensitive, or private information.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
1. The source prompts are from [ANLI](https://huggingface.co/datasets/facebook/anli) and may reflect any biases in their data collection process.
2. The responses produced will likely be reflective of any biases or limitations produced by Llama-3.1-405B-Instruct.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{oumiANLISubset,
author = {Jeremiah Greer},
title = {Oumi ANLI Subset},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-anli-subset}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
```
[](https://github.com/oumi-ai/oumi)
[](https://github.com/oumi-ai/oumi)
[](https://oumi.ai/docs/en/latest/index.html)
[](https://oumi.ai/blog)
[](https://discord.gg/oumi)
# oumi-ai/oumi-anli-subset
**oumi-anli-subset** 是一款专为**主张验证(Claim Verification)**任务设计的文本数据集,可用于大语言模型的微调工作。该数据集的提示词源自[ANLI](https://huggingface.co/datasets/facebook/anli)训练集,回复则由**[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**生成。本数据集曾用于训练**[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**,该模型宏F1值(Macro F1)可达77.2%,性能超越Claude Sonnet 3.5、OpenAI o1等当前最优(State-of-the-Art,SOTA)模型。
- **数据整理方:** [Oumi AI](https://oumi.ai/),采用Oumi推理框架完成整理
- **自然语言处理所用语言:** 英语
- **授权协议:** [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en)、[Llama 3.1 社区许可协议](https://www.llama.com/llama3_1/license/)
## 数据集用途
本数据集适用于面向主张验证任务的大语言模型(Large Language Model,LLM)监督微调。
微调教程:https://oumi.ai/halloumi
## 不适用场景
本数据集不适用于构建通用对话模型。
## 数据集结构
该数据集的字段格式说明如下:
{
# 唯一对话标识符
"conversation_id": str,
# 按用户与助手对话轮次格式组织的数据
# 示例:[{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}]
"messages": list[dict[str, str]],
# 样本元数据
"metadata": dict[str, ...],
}
## 数据集构建
### 整理初衷
为推动社区开发更可靠的基础模型,我们创建本数据集以用于训练HallOumi模型。数据集通过谷歌云平台运行Oumi推理生成。
### 源数据
查询样本取自[ANLI](https://huggingface.co/datasets/facebook/anli)数据集。
#### 数据收集与处理
通过谷歌云平台运行Oumi批量推理生成回复样本。
#### 个人与敏感信息
经核查,本数据集未包含任何已知或疑似的个人、敏感或隐私信息。
## 偏差、风险与局限性
1. 源提示词取自[ANLI](https://huggingface.co/datasets/facebook/anli)数据集,可能反映其数据收集过程中存在的各类偏差。
2. 生成的回复大概率会体现Llama-3.1-405B-Instruct模型本身存在的偏差与局限性。
## 引用信息
**BibTeX格式:**
@misc{oumiANLISubset,
author = {Jeremiah Greer},
title = {Oumi ANLI Subset},
month = {March},
year = {2025},
url = {https://huggingface.co/datasets/oumi-ai/oumi-anli-subset}
}
@software{oumi2025,
author = {Oumi Community},
title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models},
month = {January},
year = {2025},
url = {https://github.com/oumi-ai/oumi}
}
提供机构:
maas
创建时间:
2025-04-09



