five

Arena-Write

收藏
魔搭社区2025-11-26 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/THU-KEG/Arena-Write
下载链接
链接失效反馈
官方服务:
资源简介:
## 📚 Arena-Write Dataset Arena-Write is a small-scale benchmark of **100 user writing tasks**, designed to evaluate long-form generation models in realistic scenarios. Each task covers diverse formats such as social posts, essays, and reports, with many requiring outputs over 2,000 words. Project page: https://huggingface.co/THU-KEG/ ### 📄 Data Format Each data sample is a JSON object with the following fields: ```json { "idx": 1, "question": "Write a social media post about Lei Feng spirit, within 200 characters.", "type": "Community Forum", "length": 200, "baseline_response": "" } ``` - `question`: A real-world user writing prompt - `type`: Scenario tag (e.g., Community Forum, Essay) - `length`: Expected output length - `baseline_response`: Outputs from **six** strong base models (e.g., GPT-4o, DeepSeek-R1, etc.) > Each task is answered by several base models to support pairwise comparison during evaluation. ### 🧪 Evaluation Protocol - **Pairwise Comparison**: Model outputs are compared against baseline responses using LLMs judges. Each pair is evaluated twice with flipped order to reduce position bias. - **Elo Scoring**: Results are aggregated into Elo scores to track model performance. ### 📖 Citation If you use this dataset, please cite: ```bibtex @misc{wu2025longwriterzeromasteringultralongtext, title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning}, author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li}, year={2025}, eprint={2506.18841}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.18841}, }

# 📚 Arena-Write 数据集 Arena-Write 是一个包含100个用户写作任务的小规模基准测试集,旨在于真实应用场景中评估长文本生成模型。每个任务涵盖社交媒体帖文、议论文、报告等多样格式,其中多数任务要求生成超过2000词的输出内容。 项目主页:https://huggingface.co/THU-KEG/ ## 📄 数据格式 每个数据样本为包含以下字段的JSON对象: json { "idx": 1, "question": "请撰写一篇关于雷锋精神的社交媒体帖文,字数控制在200字以内。", "type": "社区论坛", "length": 200, "baseline_response": "" } - `question`:真实的用户写作提示词 - `type`:场景标签(例如社区论坛、议论文) - `length`:预期输出字数 - `baseline_response`:6个顶尖基础模型的生成输出(例如GPT-4o、DeepSeek-R1等) > 每个任务均由多个基础模型生成输出,以支撑评估过程中的两两对比。 ## 🧪 评估协议 - **两两对比**:以大语言模型(LLM)作为评判者,将待评估模型的输出与基准响应进行对比。为降低位置偏差,每一组对比均会以颠倒顺序的方式执行两次评估。 - **Elo评分**:将评估结果汇总为Elo评分,以追踪模型的性能表现。 ## 📖 引用 若您使用本数据集,请引用以下文献: bibtex @misc{wu2025longwriterzeromasteringultralongtext, title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning}, author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li}, year={2025}, eprint={2506.18841}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.18841}, }
提供机构:
maas
创建时间:
2025-07-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作