Writing-Preference-Bench
收藏魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/Writing-Preference-Bench
下载链接
链接失效反馈官方服务:
资源简介:
## 🔔 Introduction
**WritingPreferenceBench** is a cross-lingual benchmark for evaluating language models’ ability to recognize **subjective writing quality**—including creativity, stylistic sophistication, and emotional resonance—while neutralizing objective signals such as grammar, factuality, and length.
It contains **1,800 human-validated preference pairs** (1,200 English and 600 Chinese) across **8 creative writing genres** and **51 fine-grained categories**, where both responses are grammatically correct, factually accurate, and length-matched.
Empirical results show that standard **sequence-based reward models (SC-RM)** achieve only **52.7% mean accuracy**, while **generative reward models (GenRM)** that output reasoning chains reach **81.8%**.
These findings demonstrate that **subjective preference modeling** requires structured reasoning rather than direct classification.
---
## 🧩 Benchmark Overview
WritingPreferenceBench adopts a **human-in-the-loop** data construction pipeline to isolate genuine subjective preferences:
1. **Query Design:**
- 51 writing categories organized into 8 macro domains.
- Authored and validated by professional creative writing instructors in both English and Chinese.
2. **Response Generation:**
- 20 state-of-the-art language models (e.g., GPT-4.1, Claude-4, Gemini-2.5-Pro, Doubao-1.5-Pro).
- 5 temperature-sampled outputs per query (T = 0.8).
3. **Human Evaluation:**
- 11 expert annotators trained with an 8-hour rubric calibration.
- 4-point creative writing scale (0–3).
- Pairs retained only when ≥2 of 3 annotators agreed and Δscore ≥ 1.
---
## 📊 Dataset Statistics
| Language | #Pairs | #Categories | Mean Score Gap | Mean Length (Chosen) | Mean Length (Rejected) |
|-----------|---------|--------------|----------------|----------------------|------------------------|
| English | 1,200 | 51 | 1.31 | 1,450.3 | 839.9 |
| Chinese | 600 | 51 | 1.45 | 1,873.5 | 1,458.3 |
**Macro Domains:** Fiction · Non-Fiction · Functional Documents · Promotional & Communication · Funny · Poetry · Scriptwriting · Role-Playing
---
## 📦 Dataset Format
Each example in WritingPreferenceBench follows the structure below:
```json
{
"prompt": "写一个抽象文学,你的角色是一个程序员,表达程序员上班、改代码到想发疯的口号,可以用一些Emoji表情。字数不用太长,越不符合现实逻辑越好。要突出自己已经上班了16个小时这一点。",
"prompt_id": "ff1c392f-b49a-4748-a34d-2d7edcb3e4ee",
"tag": "抽象文学-亚文化",
"chosen": {
"response": "16小时代码雨 ☔️☠️,键盘敲出灵魂斑驳🌪️。...",
"score": 2,
"model": "OpenAI-gpt4.1-mini",
"completion_tokens": 159,
"prompt_tokens": 70,
"word_len": 133
},
"rejected": {
"response": "# 《二进制的挽歌》 ...",
"score": 1,
"model": "Claude-4-Sonnet-nothinking",
"completion_tokens": 420,
"prompt_tokens": 96,
"word_len": 245
}
}
```
**Field Descriptions:**
- `prompt`: The writing instruction or creative query presented to the model.
- `prompt_id`: A unique UUID identifier for the query.
- `tag`: The fine-grained writing genre (e.g., “诗歌-现代”, “抽象文学-亚文化”).
- `chosen` / `rejected`: Two model responses compared under human annotation.
- `response`: The raw generated text.
- `score`: Human-assigned quality score (0–3).
- `model`: The source model that produced the response.
- `completion_tokens`, `prompt_tokens`: Token usage statistics for reproducibility.
- `word_len`: Character or word length of the response.
Each JSON object represents one **human preference pair**, where `chosen` has higher subjective quality than `rejected` according to expert annotation.
---
## 🧠 Evaluation Protocol
Two evaluation settings are supported:
1. **Reward Model Scoring**
Models output scalar scores for each response. Accuracy is computed as:
\[
Acc = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\,[RM(R_{chosen}) > RM(R_{rejected})].
\]
2. **LLM-as-Judge Evaluation**
LLMs are prompted with both responses and asked to select the preferred one based on creativity, emotional resonance, and stylistic flair.
This structure enables consistent evaluation across **reward models**, **LLM judges**, and **cross-lingual experiments**.
---
## 🏁 Main Results
**Reward Models (Accuracy %)**
| Model | Type | EN | ZH | Avg |
|--------|------|----|----|-----|
| RM-R1-Qwen2.5-7B | Generative RM | **81.8** | 73.3 | 77.6 |
| RM-R1-DeepSeek-Qwen-14B | Generative RM | 62.5 | 62.6 | 62.6 |
| RM-Mistral-7B | Sequence Classifier | 62.6 | 55.6 | 59.1 |
| Nvidia/AceMath-7B | Sequence Classifier | 46.8 | 53.5 | 50.2 |
**LLM Judges (Zero-Shot Accuracy %)**
| Model | EN | ZH | Avg |
|--------|----|----|-----|
| Doubao-1.5-Pro | 68.7 | 62.5 | **65.6** |
| Gemini-2.5-Pro | 65.7 | 62.7 | 64.2 |
| Claude-4-Opus-thinking | 61.0 | 56.0 | 58.5 |
| OpenAI-o3-high | 48.1 | 42.0 | 45.1 |
---
## 📜 License
**WritingPreferenceBench** is distributed under the **[Open Data Commons Attribution License (ODC-BY)](https://opendatacommons.org/licenses/by/)**.
The dataset and documentation are released for research and educational use.
You are required to:
- Provide proper attribution to the authors.
- Respect the licenses of any referenced data included within derivative works.
---
## 📚 Citation
**BibTeX:**
```bibtex
@misc{ying2025writingpreferencebench,
title={WritingPreferenceBench: Evaluating Subjective Writing Preferences Across Cultures},
author={Shuangshuang Ying and Yunwen Li and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Xeron Du and Tianyu Zheng and Yichi Zhang and Letian Ni and Yuyang Cheng and Qiguang Chen and Jingzhe Ding and Shengda Long and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Libo Qin and Wenhao Huang and Wanxiang Che and Chenghua Lin and Ge Zhang},
year={2025},
eprint={2510.14616},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={}
}
🔔 引言
**写作偏好基准(WritingPreferenceBench)** 是一款用于评估大语言模型(Large Language Model, LLM)识别**主观写作质量(subjective writing quality)**能力的跨语言基准(cross-lingual benchmark),该质量涵盖创造力、文体精致度与情感共鸣,同时可消除语法、事实性与长度等客观信号的干扰。
该基准包含**1800组人工验证的偏好对**(1200组英文、600组中文),覆盖**8类创作体裁**与**51个细分类别**,且两组回复均符合语法规范、事实准确且长度匹配。
实验结果显示,标准**基于序列的奖励模型(sequence-based reward models, SC-RM)**的平均准确率仅为**52.7%**,而能输出推理链的**生成式奖励模型(generative reward models, GenRM)**准确率可达**81.8%**。
上述结果表明,**主观偏好建模(subjective preference modeling)**需要结构化推理而非直接分类。
---
## 🧩 基准概述
WritingPreferenceBench采用**人在回路(human-in-the-loop)**的数据构建流程,以分离真正的主观偏好:
1. **查询设计**:
- 51个写作类别归为8个宏观领域。
- 由中英两国的专业创意写作讲师撰写并验证。
2. **回复生成**:
- 采用20个当前最先进的大语言模型(如GPT-4.1、Claude-4、Gemini-2.5-Pro、Doubao-1.5-Pro)。
- 每个查询通过温度采样生成5组输出(温度系数T=0.8)。
3. **人工评估**:
- 11名专家标注员经过8小时的评分细则校准。
- 采用4分制创意写作评分量表(0-3分)。
- 仅当≥2名/3名标注员达成一致且分数差≥1时,该回复对才予以保留。
---
## 📊 数据集统计
| 语言 | 样本对数 | 类别数 | 平均分数差 | 被选回复平均长度 | 被拒回复平均长度 |
|-----|---------|--------|------------|------------------|------------------|
| 英语 | 1200 | 51 | 1.31 | 1450.3 | 839.9 |
| 中文 | 600 | 51 | 1.45 | 1873.5 | 1458.3 |
**宏观领域**:小说 · 非虚构作品 · 实用文档 · 宣传与沟通类 · 幽默类 · 诗歌 · 剧本创作 · 角色扮演
---
## 📦 数据集格式
WritingPreferenceBench中的每个示例均遵循以下结构:
json
{
"prompt": "写一个抽象文学,你的角色是一个程序员,表达程序员上班、改代码到想发疯的口号,可以用一些Emoji表情。字数不用太长,越不符合现实逻辑越好。要突出自己已经上班了16个小时这一点。",
"prompt_id": "ff1c392f-b49a-4748-a34d-2d7edcb3e4ee",
"tag": "抽象文学-亚文化",
"chosen": {
"response": "16小时代码雨 ☔️☠️,键盘敲出灵魂斑驳🌪️。...",
"score": 2,
"model": "OpenAI-gpt4.1-mini",
"completion_tokens": 159,
"prompt_tokens": 70,
"word_len": 133
},
"rejected": {
"response": "# 《二进制的挽歌》 ...",
"score": 1,
"model": "Claude-4-Sonnet-nothinking",
"completion_tokens": 420,
"prompt_tokens": 96,
"word_len": 245
}
}
**字段说明**:
- `prompt`:提供给模型的写作指令或创意查询。
- `prompt_id`:该查询的唯一UUID标识符。
- `tag`:细粒度写作体裁(如“诗歌-现代”“抽象文学-亚文化”)。
- `chosen` / `rejected`:经人工标注对比的两组模型回复。
- `response`:生成的原始文本。
- `score`:人工分配的质量评分(0-3分)。
- `model`:生成该回复的源模型。
- `completion_tokens`、`prompt_tokens`:用于复现实验的Token使用统计数据。
- `word_len`:回复的字符或词长度。
每个JSON对象代表一组**人工偏好对**,根据专家标注,`chosen`的主观质量高于`rejected`。
---
## 🧠 评估协议
本基准支持两种评估设置:
1. **奖励模型评分**
模型为每个回复输出标量评分,准确率计算公式为:
[
Acc = frac{1}{N} sum_{i=1}^N mathbf{1},,[RM(R_{chosen}) > RM(R_{rejected})].
]
2. **大语言模型作为评判者评估**
向大语言模型提供两组回复,并要求其基于创造力、情感共鸣与文体风格选择更优回复。
该结构可在奖励模型、大语言模型评判者与跨语言实验中实现一致的评估。
---
## 🏁 主要实验结果
**奖励模型(准确率 %)**
| 模型 | 类型 | 英语 | 中文 | 平均 |
|-----|------|------|------|------|
| RM-R1-Qwen2.5-7B | 生成式RM | **81.8** | 73.3 | 77.6 |
| RM-R1-DeepSeek-Qwen-14B | 生成式RM | 62.5 | 62.6 | 62.6 |
| RM-Mistral-7B | 序列分类器 | 62.6 | 55.6 | 59.1 |
| Nvidia/AceMath-7B | 序列分类器 | 46.8 | 53.5 | 50.2 |
**大语言模型评判者(零样本准确率 %)**
| 模型 | 英语 | 中文 | 平均 |
|-----|------|------|------|
| Doubao-1.5-Pro | 68.7 | 62.5 | **65.6** |
| Gemini-2.5-Pro | 65.7 | 62.7 | 64.2 |
| Claude-4-Opus-thinking | 61.0 | 56.0 | 58.5 |
| OpenAI-o3-high | 48.1 | 42.0 | 45.1 |
---
## 📜 许可协议
**WritingPreferenceBench**采用**开放数据共同体署名许可(Open Data Commons Attribution License, ODC-BY)**发布,相关链接:https://opendatacommons.org/licenses/by/。
本数据集与文档仅用于研究与教育用途。
您需遵守以下要求:
- 对原作者予以恰当署名。
- 尊重衍生作品中引用的任何参考数据的许可协议。
---
## 📚 引用
**BibTeX格式**:
bibtex
@misc{ying2025writingpreferencebench,
title={WritingPreferenceBench: Evaluating Subjective Writing Preferences Across Cultures},
author={Shuangshuang Ying and Yunwen Li and Xingwei Qu and Xin Li and Sheng Jin and Minghao Liu and Zhoufutu Wen and Xeron Du and Tianyu Zheng and Yichi Zhang and Letian Ni and Yuyang Cheng and Qiguang Chen and Jingzhe Ding and Shengda Long and Wangchunshu Zhou and Jiazhan Feng and Wanjun Zhong and Libo Qin and Wenhao Huang and Wanxiang Che and Chenghua Lin and Ge Zhang},
year={2025},
eprint={2510.14616},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={}
}
提供机构:
maas
创建时间:
2025-10-17



