Arena-Write
收藏魔搭社区2025-11-26 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/THU-KEG/Arena-Write
下载链接
链接失效反馈官方服务:
资源简介:
## 📚 Arena-Write Dataset
Arena-Write is a small-scale benchmark of **100 user writing tasks**, designed to evaluate long-form generation models in realistic scenarios. Each task covers diverse formats such as social posts, essays, and reports, with many requiring outputs over 2,000 words.
Project page: https://huggingface.co/THU-KEG/
### 📄 Data Format
Each data sample is a JSON object with the following fields:
```json
{
"idx": 1,
"question": "Write a social media post about Lei Feng spirit, within 200 characters.",
"type": "Community Forum",
"length": 200,
"baseline_response": ""
}
```
- `question`: A real-world user writing prompt
- `type`: Scenario tag (e.g., Community Forum, Essay)
- `length`: Expected output length
- `baseline_response`: Outputs from **six** strong base models (e.g., GPT-4o, DeepSeek-R1, etc.)
> Each task is answered by several base models to support pairwise comparison during evaluation.
### 🧪 Evaluation Protocol
- **Pairwise Comparison**: Model outputs are compared against baseline responses using LLMs judges. Each pair is evaluated twice with flipped order to reduce position bias.
- **Elo Scoring**: Results are aggregated into Elo scores to track model performance.
### 📖 Citation
If you use this dataset, please cite:
```bibtex
@misc{wu2025longwriterzeromasteringultralongtext,
title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning},
author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li},
year={2025},
eprint={2506.18841},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18841},
}
# 📚 Arena-Write 数据集
Arena-Write 是一个包含100个用户写作任务的小规模基准测试集,旨在于真实应用场景中评估长文本生成模型。每个任务涵盖社交媒体帖文、议论文、报告等多样格式,其中多数任务要求生成超过2000词的输出内容。
项目主页:https://huggingface.co/THU-KEG/
## 📄 数据格式
每个数据样本为包含以下字段的JSON对象:
json
{
"idx": 1,
"question": "请撰写一篇关于雷锋精神的社交媒体帖文,字数控制在200字以内。",
"type": "社区论坛",
"length": 200,
"baseline_response": ""
}
- `question`:真实的用户写作提示词
- `type`:场景标签(例如社区论坛、议论文)
- `length`:预期输出字数
- `baseline_response`:6个顶尖基础模型的生成输出(例如GPT-4o、DeepSeek-R1等)
> 每个任务均由多个基础模型生成输出,以支撑评估过程中的两两对比。
## 🧪 评估协议
- **两两对比**:以大语言模型(LLM)作为评判者,将待评估模型的输出与基准响应进行对比。为降低位置偏差,每一组对比均会以颠倒顺序的方式执行两次评估。
- **Elo评分**:将评估结果汇总为Elo评分,以追踪模型的性能表现。
## 📖 引用
若您使用本数据集,请引用以下文献:
bibtex
@misc{wu2025longwriterzeromasteringultralongtext,
title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning},
author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li},
year={2025},
eprint={2506.18841},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18841},
}
提供机构:
maas
创建时间:
2025-07-15



