five

oumi-c2d-d2c-subset

收藏
魔搭社区2025-12-05 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/oumi-ai/oumi-c2d-d2c-subset
下载链接
链接失效反馈
官方服务:
资源简介:
[![oumi logo](https://oumi.ai/logo_lockup_black.svg)](https://github.com/oumi-ai/oumi) [![Made with Oumi](https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg)](https://github.com/oumi-ai/oumi) [![Documentation](https://img.shields.io/badge/Documentation-oumi-blue.svg)](https://oumi.ai/docs/en/latest/index.html) [![Blog](https://img.shields.io/badge/Blog-oumi-blue.svg)](https://oumi.ai/blog) [![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi) # oumi-ai/oumi-c2d-d2c-subset **oumi-c2d-d2c-subset** is a text dataset designed to fine-tune language models for **Claim Verification**. Prompts were pulled from [C2D-and-D2C-MiniCheck](https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck) training sets with responses created from **[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)**. **oumi-c2d-d2c-subset** was used to train **[HallOumi-8B](https://huggingface.co/oumi-ai/HallOumi-8B)**, which achieves **77.2% Macro F1**, outperforming **SOTA models such as Claude Sonnet 3.5, OpenAI o1, etc.** - **Curated by:** [Oumi AI](https://oumi.ai/) using Oumi inference - **Language(s) (NLP):** English - **License:** [Llama 3.1 Community License](https://www.llama.com/llama3_1/license/) ## Uses <!-- This section describes suitable use cases for the dataset. --> Use this dataset for supervised fine-tuning of LLMs for claim verification. Fine-tuning Walkthrough: https://oumi.ai/halloumi ## Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> This dataset is not well suited for producing generalized chat models. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> ``` { # Unique conversation identifier "conversation_id": str, # Data formatted to user + assistant turns in chat format # Example: [{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": list[dict[str, str]], # Metadata for sample "metadata": dict[str, ...], } ``` ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> To enable the community to develop more reliable foundational models, we created this dataset for the purpose of training HallOumi. It was produced by running Oumi inference on Google Cloud. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> Queries were sourced from [C2D-and-D2C-MiniCheck](https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck). #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> Responses were collected by running Oumi batch inference on Google Cloud. #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> Data is not known or likely to contain any personal, sensitive, or private information. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> 1. The source prompts are from [C2D-and-D2C-MiniCheck](https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck) and may reflect any biases in their data collection process. 2. The responses produced will likely be reflective of any biases or limitations produced by Llama-3.1-405B-Instruct. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @misc{oumiC2DAndD2CSubset, author = {Jeremiah Greer}, title = {Oumi C2D and D2C Subset}, month = {March}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/oumi-c2d-d2c-subset} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} } ```

[![oumi logo]("https://oumi.ai/logo_lockup_black.svg")](https://github.com/oumi-ai/oumi) [![Made with Oumi]("https://badgen.net/badge/Made%20with/Oumi/%23085CFF?icon=https%3A%2F%2Foumi.ai%2Flogo_dark.svg")](https://github.com/oumi-ai/oumi) [![Documentation]("https://img.shields.io/badge/Documentation-oumi-blue.svg")](https://oumi.ai/docs/en/latest/index.html) [![Blog]("https://img.shields.io/badge/Blog-oumi-blue.svg")](https://oumi.ai/blog) [![Discord]("https://img.shields.io/discord/1286348126797430814?label=Discord")](https://discord.gg/oumi) # oumi-ai/oumi-c2d-d2c-subset **oumi-c2d-d2c-subset** 是一款专为**主张验证(Claim Verification)**任务微调大语言模型(Large Language Model,简称LLM)而设计的文本数据集。其提示词取自[C2D-and-D2C-MiniCheck]("https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck")训练集,回复由**[Llama-3.1-405B-Instruct]("https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct")**生成。本数据集被用于训练**[HallOumi-8B]("https://huggingface.co/oumi-ai/HallOumi-8B")**,该模型的宏F1值(Macro F1)可达77.2%,性能优于当前最优(SOTA)模型,如Claude Sonnet 3.5、OpenAI o1等。 - **数据整理:** [Oumi AI]("https://oumi.ai/"),依托Oumi推理工具完成 - **语言(自然语言处理,NLP):** 英语 - **授权协议:** [Llama 3.1社区许可协议]("https://www.llama.com/llama3_1/license/") ## 使用场景 本数据集可用于面向主张验证任务的大语言模型有监督微调。 微调教程:https://oumi.ai/halloumi ## 不适用场景 本数据集不适用于构建通用对话模型。 ## 数据集结构 { # 唯一对话标识符 "conversation_id": str, # 按照对话格式组织的用户与助手交互轮次数据 # 示例:[{'role': 'user', 'content': ...}, {'role': 'assistant', 'content': ...}] "messages": list[dict[str, str]], # 样本元数据 "metadata": dict[str, ...], } ## 数据集构建 ### 构建初衷 为助力社区开发更可靠的基础模型,我们创建本数据集以用于训练HallOumi。本数据集通过在谷歌云(Google Cloud)上运行Oumi推理任务生成。 ### 源数据 提示词查询样本取自[C2D-and-D2C-MiniCheck]("https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck")数据集。 #### 数据收集与处理 回复样本通过在谷歌云运行Oumi批量推理任务收集得到。 #### 个人与敏感信息 本数据集未包含、也大概率不会包含任何个人、敏感或隐私信息。 ## 偏差、风险与局限性 1. 本数据集的源提示词取自[C2D-and-D2C-MiniCheck]("https://huggingface.co/datasets/lytang/C2D-and-D2C-MiniCheck"),可能继承该数据集在数据收集过程中存在的各类偏差。 2. 生成的回复大概率会反映Llama-3.1-405B-Instruct本身存在的偏差与局限性。 ## 引用 ### BibTeX格式: @misc{oumiC2DAndD2CSubset, author = {Jeremiah Greer}, title = {Oumi C2D and D2C Subset}, month = {March}, year = {2025}, url = {https://huggingface.co/datasets/oumi-ai/oumi-c2d-d2c-subset} } @software{oumi2025, author = {Oumi Community}, title = {Oumi: an Open, End-to-end Platform for Building Large Foundation Models}, month = {January}, year = {2025}, url = {https://github.com/oumi-ai/oumi} }
提供机构:
maas
创建时间:
2025-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作