beemo
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/toloka/beemo
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Beemo
<img src="beemo.gif" width="65" height="65" />
<small> [GIF Source.](https://slackmojis.com/emojis/67173-bmo) </small>
## Dataset Description
Beemo (**B**enchmark of **e**xpert-**e**dited **m**achine-generated **o**utputs) is a benchmark for fine-grained machine-generated text detection, which consists of 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs and edited by expert annotators for various use cases. Furthermore, each machine-generated text is edited by two state-of-the-art LLMs using several diverse editing prompts, which results in 13.1k machine-generated & LLM-edited texts. We make one of the first attempts to address more practical machine-generated text detection scenarios, where the user refines the LLM output or utilizes another LLM to make it more human-like.
We describe our high-level benchmark creation approach here and provide more details in our paper.
<small> Our benchmark is named after BMO (abbreviated from "Be MOre", phonetically spelled "Beemo"), one of the main characters of Adventure Time. </small>
* 📊 **Curated by**: Toloka, Penn State University, MIT Lincoln Laboratory, and University of Oslo.
* 🌐 **Language(s)**: English
* 👾 **Repository**: [github.com/Toloka/beemo](https://github.com/Toloka/beemo)
* 🗞️ **Paper**: [arxiv.org/abs/2411.04032](https://arxiv.org/abs/2411.04032) (to appear at NAACL 2025)
* 🪪 **License**: MIT
## Dataset Creation

The Beemo's creation approach involves:
* (a) 🤖 **Machine-generated Text Collection**: prompting an instruction-finetuned LLM;
* (b) 👩🏻🔬 **Expert-based Editing**: editing the LLM's output by an expert annotator;
* (c) 🦾 **LLM-based Editing**: editing the LLM's output by two state-of-the-art LLMs.
### 🤖 Machine-generated Text Collection
The [No Robots 🙅♂️🤖](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset is used as the source of prompts and corresponding human-written texts across the following categories: Generation, Rewrite, Summarize, Open QA, and Closed QA. We randomly sample each prompt to generate an output with one of ten open-source instruction-finetuned LLMs using the default 🤗 HuggingFace chat templates and inference hyperparameters.
<details>
<summary><b>Instruction-finetuned LLMs</b></summary>
| Name |Base | SFT corpus | License | Paper |
|:-------------------------------------|:--------|:-------------------------------------------------------------------|:--------------|:--------------------------------------------------------------|
| [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | Mistral-7B-v0.1 | UltraChat, UltradFeedback | MIT | [Tunstall et al. (2023)](https://arxiv.org/abs/2310.16944) |
| [allenai/tulu-2-7b](https://huggingface.co/allenai/tulu-2-7b) | Llama 2 7B | human-written and synthetic | AI2 ImpACT | [Ivison et al (2023)](https://arxiv.org/abs/2311.10702) |
| [allenai/tulu-2-13b](https://huggingface.co/allenai/tulu-2-13b) | Llama 2 13B | human-written and synthetic | AI2 ImpACT | [Ivison et al. (2023)](https://arxiv.org/abs/2311.10702) |
| [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | Gemma 2B | human-written and synthetic | Gemma license | [Gemma Team et al. (2024)](https://arxiv.org/abs/2403.08295) |
| [google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | Gemma 7B | human-written and synthetic | Gemma license | [Gemma Team et al. (2024)](https://arxiv.org/abs/2403.08295) |
| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | Llama 2 7B | Misc.| Llama license | [Touvron et al. (2023)](https://arxiv.org/abs/2307.09288) |
| [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | Llama 2 13B | Misc.| Llama license | [Touvron et al. (2023)](https://arxiv.org/abs/2307.09288) |
| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | Llama 2 70B | Misc.| Llama license | [Touvron et al. (2023)](https://arxiv.org/abs/2307.09288) |
| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | Mistral-7B-v0.1 | Misc. | Apache-2.0 | [Jiang et. al (2023)](https://arxiv.org/abs/2310.06825) |
| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | Mixtral 8x7B | Misc.| Apache-2.0 | [Jiang et al. (2024)](https://arxiv.org/pdf/2401.04088) |
| [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | Llama-3.1 | Misc. | Llama | [Dubey et al. (2024)](https://arxiv.org/abs/2407.21783) |
| [GPT-4o](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence) | GPT-4 | Misc. | OpenAI | [OpenAI (2024)](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) |
<medium> Table 1: Overview of the instruction-finetuned LLMs used to create Beemo. ```GPT-4o``` and ```meta-llama/Llama-3.1-70B-Instruct``` are used only for LLM-based editing. </medium>
</details>
### 👩🏻🔬 Expert-based Editing
The machine-generated texts are edited by an in-house team of annotators, who are well experienced in refining content produced by LLMs.
### 🦾 LLM-based Editing
The machine-generated texts are "humanized" by ```GPT-4o``` and ```meta-llama/Llama-3.1-70B-Instruct``` using three editing prompts.
<details>
<summary><b>Editing prompts</b></summary>
* P1: ```You are given a prompt and a text generated by AI using this prompt. Your task is to edit the AI-generated text to make it sound human-like and error-free. Ensure your overall edits do not exceed 40% of the generated text and the edited text follows the user request. Output only the edited text and do not explain your edits.\n\nPrompt: {prompt}\n\nAI text: {model_output}```
* P2: ```You are given a pair containing two components: (1) a user prompt for an AI assistant and (2) the AI assistant’s response. Refine the AI-generated response to make it sound more natural. Vary your editing patterns and the portions of text you choose to modify, and ensure your overall edits are 20-40% of the words in the response.\n\nUser prompt: {prompt}\n\nAI-generated response: {model_output}```
* P3: ```Modify a machine-generated response to a given prompt to make it appear more like it was written by a native English speaker. Ensure the revised version follows the user's intent. You should just give me the revised version without any other words.\n\nPrompt: {prompt}\n\nMachine-generated response: {model_output}```
</details>
## Dataset Structure
### Dataset Instances
🗒️ Each dataset instance looks as follows:
```
{
'id': '1139',
'category': 'Open QA',
'model': 'mistralai/Mixtral-8x7B-Instruct-v0.1',
'prompt_id': '615aadc25bc591493426c8ca2e60e79daeb9e6a7118199fe4efa8e3f41b01fc2',
'prompt': 'Was Ozzie Smith ever voted into the Baseball Hall of Fame?',
'model_output': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2',
'human_output': 'Ozzie Smith was inducted into the National Baseball Hall of Fame on July 28, 2002. He had been elected on his first ballot, the first time he was eligible, by receiving 91.7% of the votes cast. He was also one of 22 former players and personnel to be inducted into the St. Louis Cardinals Hall of Fame Museum in 2014 for the inaugural class.',
'human_edits': 'Ozzie Smith was elected to the Baseball Hall of Fame in 2002.',
'llama-3.1-70b_edits': [
{'P1': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'},
{'P2': 'Refined response: Yes, Ozzie Smith was indeed elected to the National Baseball Hall of Fame, securing his spot in 2002.'},
{'P3': "Ozzie Smith was indeed elected to the National Baseball Hall of Fame in 2002, in his first year of eligibility, a testament to his illustrious 21-year career as one of baseball's all-time greatest defensive shortstops."}
],
'gpt-4o_edits': [
{'P1': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'},
{'P2': 'Yes, Ozzie Smith was indeed inducted into the Baseball Hall of Fame in 2002.'},
{'P3': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'}
]
}
```
### Dataset Fields
`id`: example id \
`category`: prompt category from No Robots 🙅♂️🤖 \
`model`: name of the instruction-finetuned LLM that generated the response \
`prompt_id`: prompt id from No Robots 🙅♂️🤖 \
`prompt`: describes the user request an LLM should perform \
`model_output`: the LLM's response to the prompt \
`human_output`: the human-written gold standard response to the prompt from No Robots 🙅♂️🤖 \
`human_edits`: the expert-edited LLM's response \
`llama-3.1-70b_edits`: three ```meta-llama/Llama-3.1-70B-Instruct```-edited LLM's response versions \
`gpt-4o_edits`: three ```GPT-4o```-edited LLM's response versions
## Licensing Information
* The prompts (`prompt`) and human-written texts (`human_output`) from No Robots 🙅♂️🤖 are under the original dataset's license: CC-BY-NC-4.0.
* The machine-generated (`model_output`) and their LLM-edited versions (`llama-3.1-70b_edits` and `gpt-4o_edits`) are subject to the underlying LLMs' licensing terms mentioned in Table 1.
* The expert-edited machine-generated texts (`human_edits`) are available under the MIT license, unless otherwise specified in the underlying instruction-finetuned LLMs' licensing terms.
## Cite us
```
@article{artemova2024beemo,
title={Beemo: Benchmark of Expert-edited Machine-generated Outputs},
author={Artemova, Ekaterina and Lucas, Jason and Venkatraman, Saranya and Lee, Jooyoung and Tilga, Sergei and Uchendu, Adaku and Mikhailov, Vladislav},
journal={arXiv preprint arXiv:2411.04032},
year={2024}
}
```
## Contact us
* Vladislav Mikhailov (vladism@ifi.uio.no)
* Ekaterina Artemova (katya-art@toloka.ai)
# Beemo 数据集卡片
<img src="beemo.gif" width="65" height="65" />
<small> [GIF来源。](https://slackmojis.com/emojis/67173-bmo) </small>
## 数据集概述
Beemo(**专家编辑的机器生成输出基准(Benchmark of expert-edited machine-generated outputs,缩写Beemo)**)是面向细粒度机器生成文本检测的基准数据集,包含6.5k条人类撰写文本、10个开源指令微调大语言模型(Large Language Model,LLM)生成的文本,以及经过专家标注者针对不同应用场景编辑后的版本。此外,每一条机器生成文本还会由两个顶尖大语言模型使用多种多样化编辑提示词进行编辑,最终得到13.1k条经大语言模型编辑的机器生成文本。本研究首次尝试探索更具实用性的机器生成文本检测场景,即用户对大语言模型输出进行优化,或借助另一款大语言模型让文本更贴近人类撰写风格。
我们在此介绍本基准数据集的整体构建流程,详细细节请参阅我们的论文。
<small> 本数据集以《探险活宝》的主要角色之一BMO命名,其全称为“Be MOre”,谐音为“Beemo”。 </small>
* 📊 **主办方**:Toloka、宾夕法尼亚州立大学、MIT林肯实验室、奥斯陆大学
* 🌐 **语言**:英语
* 👾 **代码仓库**:[github.com/Toloka/beemo](https://github.com/Toloka/beemo)
* 🗞️ **论文**:[arxiv.org/abs/2411.04032](https://arxiv.org/abs/2411.04032)(将发表于NAACL 2025)
* 🪪 **许可证**:MIT许可证
## 数据集构建流程

Beemo的构建流程包含以下三个步骤:
* (a) 🤖 **机器生成文本采集**:基于指令微调大语言模型生成文本;
* (b) 👩🏻🔬 **专家编辑**:由专家标注者对大语言模型的输出进行编辑;
* (c) 🦾 **大语言模型编辑**:由两个顶尖大语言模型对大语言模型的输出进行编辑。
### 🤖 机器生成文本采集
本数据集以[No Robots 🙅♂️🤖](https://huggingface.co/datasets/HuggingFaceH4/no_robots)数据集作为提示词与对应人类撰写文本的来源,覆盖生成、改写、摘要、开放域问答、封闭域问答五大类别。我们随机抽取每条提示词,使用默认的🤗 HuggingFace对话模板与推理超参数,通过10款开源指令微调大语言模型之一生成对应输出。
<details>
<summary><b>指令微调大语言模型详情</b></summary>
| 名称 | 基础模型 | SFT语料 | 许可证 | 论文 |
|:-------------------------------------|:--------|:-------------------------------------------------------------------|:--------------|:--------------------------------------------------------------|
| [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | Mistral-7B-v0.1 | UltraChat、UltradFeedback | MIT许可证 | [Tunstall等人(2023)](https://arxiv.org/abs/2310.16944) |
| [allenai/tulu-2-7b](https://huggingface.co/allenai/tulu-2-7b) | Llama 2 7B | 人类撰写与合成文本 | AI2 ImpACT许可证 | [Ivison等人(2023)](https://arxiv.org/abs/2311.10702) |
| [allenai/tulu-2-13b](https://huggingface.co/allenai/tulu-2-13b) | Llama 2 13B | 人类撰写与合成文本 | AI2 ImpACT许可证 | [Ivison等人(2023)](https://arxiv.org/abs/2311.10702) |
| [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | Gemma 2B | 人类撰写与合成文本 | Gemma许可证 | [Gemma团队等人(2024)](https://arxiv.org/abs/2403.08295) |
| [google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | Gemma 7B | 人类撰写与合成文本 | Gemma许可证 | [Gemma团队等人(2024)](https://arxiv.org/abs/2403.08295) |
| [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | Llama 2 7B | 多源文本 | Llama许可证 | [Touvron等人(2023)](https://arxiv.org/abs/2307.09288) |
| [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) | Llama 2 13B | 多源文本 | Llama许可证 | [Touvron等人(2023)](https://arxiv.org/abs/2307.09288) |
| [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | Llama 2 70B | 多源文本 | Llama许可证 | [Touvron等人(2023)](https://arxiv.org/abs/2307.09288) |
| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | Mistral-7B-v0.1 | 多源文本 | Apache-2.0许可证 | [Jiang等人(2023)](https://arxiv.org/abs/2310.06825) |
| [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | Mixtral 8x7B | 多源文本 | Apache-2.0许可证 | [Jiang等人(2024)](https://arxiv.org/pdf/2401.04088) |
| [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | Llama-3.1 | 多源文本 | Llama许可证 | [Dubey等人(2024)](https://arxiv.org/abs/2407.21783) |
| [GPT-4o](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence) | GPT-4 | 多源文本 | OpenAI许可证 | [OpenAI(2024)](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) |
<medium> 表1:用于构建Beemo的指令微调大语言模型概览。注:`GPT-4o`与`meta-llama/Llama-3.1-70B-Instruct`仅用于大语言模型编辑环节。 </medium>
</details>
### 👩🏻🔬 专家编辑
机器生成文本由一支经验丰富的内部标注团队进行编辑,该团队擅长优化大语言模型生成的内容。
### 🦾 大语言模型编辑
机器生成文本将由`GPT-4o`与`meta-llama/Llama-3.1-70B-Instruct`通过三种编辑提示词进行“人类化”处理。
<details>
<summary><b>编辑提示词详情</b></summary>
* P1: 你将获得一条提示词以及基于该提示词由人工智能生成的文本。你的任务是编辑该人工智能生成的文本,使其听起来更像人类撰写且无错误。请确保整体编辑幅度不超过生成文本的40%,且编辑后的文本符合用户的请求。仅输出编辑后的文本,无需解释编辑过程。
提示词:{prompt}
人工智能文本:{model_output}
* P2: 你将获得一组包含两个部分的内容:(1) 面向人工智能助手的用户提示词;(2) 人工智能助手的回复。请优化该人工智能生成的回复,使其表达更自然。请多样化你的编辑方式与修改的文本范围,确保整体编辑幅度为回复单词数的20%-40%。
用户提示词:{prompt}
人工智能生成回复:{model_output}
* P3: 修改针对给定提示词的机器生成回复,使其更贴近母语为英语的撰写者的风格。请确保修改后的版本符合用户的意图。仅输出修改后的版本,无需添加其他内容。
提示词:{prompt}
机器生成回复:{model_output}
</details>
## 数据集结构
### 数据集实例
🗒️ 每条数据集实例格式如下:
{
'id': '1139',
'category': 'Open QA',
'model': 'mistralai/Mixtral-8x7B-Instruct-v0.1',
'prompt_id': '615aadc25bc591493426c8ca2e60e79daeb9e6a7118199fe4efa8e3f41b01fc2',
'prompt': 'Was Ozzie Smith ever voted into the Baseball Hall of Fame?',
'model_output': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2',
'human_output': 'Ozzie Smith was inducted into the National Baseball Hall of Fame on July 28, 2002. He had been elected on his first ballot, the first time he was eligible, by receiving 91.7% of the votes cast. He was also one of 22 former players and personnel to be inducted into the St. Louis Cardinals Hall of Fame Museum in 2014 for the inaugural class.',
'human_edits': 'Ozzie Smith was elected to the Baseball Hall of Fame in 2002.',
'llama-3.1-70b_edits': [
{'P1': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'},
{'P2': 'Refined response: Yes, Ozzie Smith was indeed elected to the National Baseball Hall of Fame, securing his spot in 2002.'},
{'P3': "Ozzie Smith was indeed elected to the National Baseball Hall of Fame in 2002, in his first year of eligibility, a testament to his illustrious 21-year career as one of baseball's all-time greatest defensive shortstops."}
],
'gpt-4o_edits': [
{'P1': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'},
{'P2': 'Yes, Ozzie Smith was indeed inducted into the Baseball Hall of Fame in 2002.'},
{'P3': 'Yes, Ozzie Smith was elected to the Baseball Hall of Fame in 2002.'}
]
}
### 数据集字段说明
`id`:示例编号
`category`:来自No Robots 🙅♂️🤖数据集的提示词类别
`model`:生成回复的指令微调大语言模型名称
`prompt_id`:来自No Robots 🙅♂️🤖数据集的提示词编号
`prompt`:描述大语言模型需要执行的用户请求
`model_output`:大语言模型针对该提示词生成的回复
`human_output`:来自No Robots 🙅♂️🤖数据集的人类撰写标准回复
`human_edits`:专家编辑后的大语言模型回复
`llama-3.1-70b_edits`:3个经`meta-llama/Llama-3.1-70B-Instruct`编辑后的大语言模型回复版本
`gpt-4o_edits`:3个经`GPT-4o`编辑后的大语言模型回复版本
## 许可证信息
* 来自No Robots 🙅♂️🤖数据集的提示词(`prompt`)与人类撰写文本(`human_output`)遵循原数据集的许可证:CC-BY-NC-4.0。
* 机器生成文本(`model_output`)及其经大语言模型编辑的版本(`llama-3.1-70b_edits`与`gpt-4o_edits`)需遵循表1中提及的对应大语言模型的许可条款。
* 专家编辑后的机器生成文本(`human_edits`)采用MIT许可证,除非底层指令微调大语言模型的许可条款另有规定。
## 引用我们
@article{artemova2024beemo,
title={Beemo: Benchmark of Expert-edited Machine-generated Outputs},
author={Artemova, Ekaterina and Lucas, Jason and Venkatraman, Saranya and Lee, Jooyoung and Tilga, Sergei and Uchendu, Adaku and Mikhailov, Vladislav},
journal={arXiv preprint arXiv:2411.04032},
year={2024}
}
## 联系方式
* Vladislav Mikhailov (vladism@ifi.uio.no)
* Ekaterina Artemova (katya-art@toloka.ai)
提供机构:
maas
创建时间:
2025-09-15



